The server outage saga

AffectedArc07 · August 30, 2023

Warning - This is going to get incredibly nerd in places and I apologise. If you want clarifications in places, let me know.

Chapter 1 - The warning shots

For the past month or so, the server has been slightly unstable.

As you can see here, theres server Watch Dog warnings about a server crash. This isn't good, but I chalked it up to BYOND for the longest time as that was the status quo, with everything recovering fine afterwards. Little did I know of the storm that was brewing.

Chapter 2 - The first outage

As a lot of you probably remember, we had an outage on the 15th of August 2023.

Before making this announcement, I assumed it to just be the game VM itself crashing due to windows kicking up the ghost, with the only real indication being that Corn was unable to login to the management VPN.

(For context - the management VPN is what gives access to all the Paradise virtual machines, similar to a work-from-home VPN you might have at your company). The fact that Corn couldn't even connect to this is what made me go "oh fuck its bad", especially with the hypervisor just not responding to stop commands, but after 6 hours and a full machine reboot (which it did itself), I assumed it was just a blip, and we moved on. However, this was indeed the calm before the storm.

Chapter 3 - The real catastrophe

Picture this. Its 10AM on a Monday morning (28th August 2023). You're lay in bed, enjoying a bank holiday (for the non UK people, we have bank holidays on random days in the year, which is basically a government mandated "ok its not a national holiday but you dont have to work" day), when I get the following message.

image.png.9895adc2977236166c863577a5a53c77.png

Unbeknownst to Adri, the entire server had just shat itself, hardcore.

A bit of background. Paradise is made up of 8 virtual machines running on the hypervisor (the server that runs the VMs). These consist of:

A router for managing internal IPs and a management VPN (Own VM because pfsense + stability requirements)
The core server with the database, ALICE, and about 10 internal automation tools (Own VM because its everything else)
The webserver with the wiki, forums, and all the backend stuff (Own VM because its such a huge vector)
The game server itself (Own VM because it needs windows)
A server dedicated to Elasticsearch + Kibana for log & metric analysis (own VM so I can give Denghis the keys and leave him to it)
An internal GitLab (Own VM to confine the fuck out of it)
A PRTG runner for network monitoring (Own VM because it needs windows and its not going on the gameserver)
A VM for Corn to make his stats thing

Anyway, I proceed to get out of bed and go to login to the server panel, only to be hit with a page timeout. So my next assumption is "oh its networking", but then I dig further and realise how fucked we are.

image.png.3f731c67cd7fd004dacb75d9d9b07662.png

For those unfamiliar, a segmentation fault is when a program tries to read RAM it isn't allowed to. But thats not all! My next mentality is to try reinstall the Proxmox packages, so we can use the hypervisor again. Thats when I got a kernel error trying to merely run apt update.

image.png.de8105cb7705f59855def0e1bce49772.png

So as you can see, its already running bad. But wait. It gets worse!

I ran the same command (a version check) twice within the same second. One worked, one died instantly. The same also happened with a check script.

image.png.14fd061f610e3443f753f259fe697b31.png

And trying to list VMs again, it worked half the time.

image.png.379f62a86098515fdbe79601deb13c97.png

And above all, straight up corruption

So at this point, I establish there is something up, and do the logical thing of a memtest. Which it somehow passed.

image.png.c2ff519bc04e8f968e513b1841f2fe7c.png

So my next assumption is "Oh the OS is corrupt somehow, lets move the VMs off to get it fixed". This then unearthed a whole other pile of problems. Paradise has ~1.2TB of data, and all that has to go somewhere if you are wiping the disk, and the only backup I had on hand big enough was a storage share in germany, so I started backing up VMs to that.

image.png.1969c87d50fddf0d65571e07a7c11143.png

If you do the maths, you can see that this will take a long time. So I then went to provision storage in eastern US, and at this point you can tell I was out of cares to give.

image.png.2b63c046c008b602304132cf0bda0850.png

I needed fast, dependable storage, so I went to Azure. This helped. Considerably.

image.png.34135acee8fd39234671d55b9c36d79d.png

However, we were now 8 hours in, and the "oh god this is bad" mentality was setting in, especially with the server getting more and more unstable by the second.

image.png.ac074da267986387506f19d4194e3895.png

This dragged on, and we started to run low on time, which is when the "I ain't getting this done in a day" mentality set in, however I had all the VMs backed up, some in Germany, some in the US, and I was ready to reinstall the OS. So I grabbed a fresh copy of debian, the most stable thing I can get, and booted into the hardware console for the server, ready to reinstall.

That's when the debian installer segfaulted on me. I rebooted, it got past that, then died in a separate place. The RAM was fine, and neither of these segfaults were at point where a disk write was required, which led to finally point a finger at the CPU being fucked. It would line up with everything else, however at this point it was 1AM, I had work the following day, so I had to throw in the towel and go at it the following day.

Chapter 4 - The next day

I get home from work, and immediately get shopping for another server. There's another 13900K in stock, and I grab it. OS installs without a hitch, and notice the board is a new revision compared to our outgoing server (not just a different BIOS, a straight up different hardware revision). The main "oh this one is better" point was that I didn't have to blacklist half the motherboard modules this time.

For context server motherboards have extra features on them to help with management, such as a virtual monitor+keyboard+mouse so you can control it remote-desktop style remotely without an OS loaded. Its bloody neat. However, to get this board working on the old server, I had to make the OS completely ignore all these modules, otherwise it would hard lock whenever it tried to interface with it, which clearly isn't desirable behaviour.

Anyway, I get proxmox installed, setup a basic bit of networking so I have a management VPN again, and then get to deploying the VMs. Which turned out to be a very, very slow restore process. Half of the VMs were stuck in Germany at this point, with a 1 megabyte restore speed, so I then went to move them from Germany to US-E to restore them faster. However:

image.png.a6ec70f2d23143eda9a070e56cb344c9.png

This would take 8 hours. I dont have time for that. However, we have more Azure regions.

image.png.b6b7db3820376243fb32447899fdf1a1.png

This helped. Considerably.

image.png.77f3ae74887c9f4fdb14e16f7eb0e999.png

The final step was to get it moved from EU-W to US-E, which can easily be done with AzCopy, going via Microsoft's backbone instead of my own upload speed.

image.png.8b575202d7c1c054e98642ce8e5c38e5.png

Anyway, enough backup talk, by this point it was restoring onto the new box, and I thought we had the light at the end of the tunnel. I even threw up the gameserver before everything else was ready because people had been patient and not bugged me, not to mention the support thrown my way (bless all of you who pitched in). I launched the server, stability checked it, then went to bed.

Chapter 5 - The light?

Day 3, I wake up, spin the last VMs up that I had left to restore overnight, did some quick validation, and all looked fine. No pings of downtime overnight, and I think we are finally sorted, so I leave for work and pretend all is fine.

I get home, and notice that TD is worse off than before. Like, 15% TD with 80 people, that is not a good amount when we can normally handle 150 people with about 1% TD, so I go to investigate clocks, and notice we are running at 3.4Ghz. Our target is 5.8Ghz, so theres a lot of performance missing here. Fast forward 3 hours of fucking with various power limits both in the OS and the BIOS, and I throw in the towel. I am beat. The server is running, but she isn't running optimally. Shes basically a car in limp mode.

Upon investigation, the CPU voltage fluctuates from as high as 1.08V all the way down to 0.72V, which for a 13900KF is very low. With little time to figure this out since I cant change BIOS settings without taking the server down, I bit the bullet and decided we can run on gimped performance now.

Chapter 6 - What now?

So we are running, but far from optimally. We are missing out on performance, but it runs, and I am in 2 minds of whether to investigate it. Intel 14th gen is rumoured to come out in September/October time, which lines up well with out upgrade schedule given its only a month or two away. There's also the option of RYZEN on the table still, with a 7950X not only being affordable on a better config for us ($200 instead of $300, and with more storage), so I guess we wait for what the tide brings us.

People have also mentioned that on kernel 6.2, the intel power state governer was modified, which may be the cause for this, since we were on 5.10 before.

I also want to take this to handle some points of criticism that came up during this

"Why is it down for 24 hours why arent you fixing it?"

Contrary to the stereotype, some of us have jobs, including me. Given I dont get paid for doing anything regarding hosting on paradise, theres a clear priority here to do actual work over para. Remember, I am a volunteer here, as are the headcoders, the admins, and everyone else. No one is paid here

"Why dont you have a team of people doing this?"

We dont really have an infrastructure team on this scale because its never really been required. I have Corn as a subordinate to manage stuff on the VMs themselves, and Denghis to handle the log analytics stack, but thats where 99% of it goes. Hypervisor management never really comes into play, nor does the hardware itself, hence why I never rolled out a team for it, and probably never will.

"Why dont you have a high availability setup?"

Simple - cost. Our server costs $300 a month. We are not doubling our monthly outlay for a box thats going to get maybe a days use over an entire year.

"Why dont you use cloud?"

This is a point I will actually expand upon. SS13 is not designed for cloud hosting, simple as. Most cloud PaaS offerings are for generic business things (See: Azure Logic Apps), and any PaaS hosting is always for webapps with 80+443 being your only exports. Theres also the issue with performance. No cloud provider will outfit machines with 13900K CPUs, its not viable for a multitude of reasons, and the single-core performance on xeon (which is what BYOND needs remember) is laughable compared to a standard Core-SKU CPU.

The other big hitter is bandwidth. We use 5TB of outbound bandwidth a month. On Azure this is $426 a month. On AWS this is $460 a month. And no, that doesn't include the server or any other resources. Cloud makes sense for businesses. We are not a business.

Chapter 7 - Closing notes

I just want to take a moment to thank everyone for sticking through this with me. This is undoubtedly the worst outage in Paradise history, and I do feel responsible since it rests on me, however you lot have been supportive through the entire thing and that has really helped me mentally wise. Thank you to all of you.

- aa07

EDIT - Forgot to mention

image.png.376698a3d59cd84357a09a41eb94c1a1.png

We've been through three of these boxes now.

AffectedArc07 · May 25

Update I should have posted like a week ago.

The unlucky sods at CM got given our fucked box.

image.png.eace64f7f375c6b2c27059e51770d6ca.png

image.png.86698a8646713f1cd83098028f19c12d.png

Sign In

The server outage saga

Recommended Posts

AffectedArc07

Link to comment

Share on other sites

AffectedArc07

Link to comment

Share on other sites

Recently Browsing 0 members

Browse

Links

Activity

Important Information