Jump to content

2021 - A space (hosting) odyssey


Recommended Posts

WARNING: This entire story is just complete nerd shit so prepare for technohell 

Expect more chapters as time goes on, but this is basically the chronicles of why you never want to become host
 

Chapter listing

  1. Obtaining the damn server
  2. The pain of an OS install when your ISO is 6 gigabytes and moves at 3 megabytes a second
  3. The story of how windows (and/or asrock) sucks
  4. Moving everything to a new platform
  5. The first performance test, AKA byond sucks on linux
  6. Getting the game running on windows
  7. How to migrate with minimal downtime
  8. Why emulation sucks and paravirtualisation is the future (EXTRA EXTRA: BYOND sucks)
  9. The woes of getting logs into a nice viewer without killing the CPU
  10. BONUS CHAPTER - Remotely managing a server where you dont have a KVM/IPMI console 
  • Thanks 6
  • clown 1
  • explodyparrot 2
Link to comment
Share on other sites

Chapter 1 - Obtaining the damn server

 

Actually obtaining the server is no easy feat, as there are 2 key factors that make it difficult to do:

  • Needs very high fast single-thread performance
  • Needs to be in NYC for the high-speed connection between US <-> EU

These two factors combined make the server excellent for SS13 hosting, but also horrifically difficult to buy. This kind of server is also perfect for stock trading due to very fast execution speed, and since its in NYC, it has sub-millisecond latency to the NYSE, allowing you to have trade bots run at phenomenal speeds. It took about 2 weeks of constant scouring to actually secure one in time. We even had an incident where we were in process of buying one, and it went out of stock while we were putting in payment. That is how hard-to-get these things are. Oh, and since its just my luck, on the day the server was available, I was ill and bedridden to the point where I couldn't even use a PC, so I had to spam poor neca from my phone like "OI! OI! OI! OI! OI! BUY THE SERVER". But hey, we ended up with one, that I declared "beefy enough".

 

The end result server has:

  • A RYZEN 5900X CPU
  • 128GB of RAM
  • A 2TB NVMe SSD
  • A 10 gigabit symmetrical outbound internet connect

and is of course, in NYC so we get the good connection. 

After the purchasing hell, I was expecting stuff to be nowhere near as bad as this, but oh boy was this only the beginning. 

Link to comment
Share on other sites

Chapter 2 - The pain of an OS install when your ISO is 6 gigabytes and moves at 3 megabytes a second

 

With the server obtained, I was now able to get to work. Most servers have a thing called IPMI, which is basically a web GUI for anything you'd need physical access for, including but not limited to:

  • Hardware status such as fan speeds and temperatures
  • Remote media mounting to install an operating system without a USB stick
  • Power on/off control
  • A remote console with mouse, keyboard and video
    • This operates outside of an OS remote desktop, meaning you can use it even when the OS is locked up, and before the OS has loaded, so you can do things such as change BIOS settings and more

 

Now, to actually install the OS on here, I just had to put it on the virtual media drive and boot the server to that. Simple enough right? Oh lord...

Asrock's IPMI is pretty damn flawed, and the virtual media has a max speed of 3 megabytes a second, and the ISO for the OS was about 6 gigabytes. So I had to sit here and babysit the OS install to make sure it installed properly. Oh, and it also has to be reset if your connection drops for even a second. I spent a full 2 hours just getting the OS installed, and the mentors can tell you this involved a lot of screaming. 

However, after the screaming, I had my OS (Microsoft hyper-v server) installed, and I was ready to start work on things. I made my VMs, did the base security setup, and was ready to start performance testing, however I was about to run into a major issue.

  • honk 1
Link to comment
Share on other sites

Chapter 3 - The story of how windows (and/or asrock) sucks

 

So I mentioned before how BYOND needs single threaded speed to be usable at all. This is achieved using a high end CPU, and having a high clock speed. Our previous server had an intel 9900K CPU which regularly turbo boosted to 4.9Ghz. This was very impressive and ran the server very well. In my research, I found that the RYZEN 5900X outperforms the 9900K in single thread performance by roughly 20%, so I expected it to all be going well. Now heres where the real issue begins.

image.png.76945c19b4aae54ba0a60c35f38f8fff.png

Windows (Hyper-V server, windows server 2016, and windows server 2019) does not have the CPU microcode to properly boost this CPU, which meant our new, faster CPU was now being hardcapped at a lower speed, making it slower than the previous CPU. 

After about 2 weeks of debugging and testing, I discovered that it was down to windows and unfixable, which meant I had to use a new platform for everything. Enter Proxmox VE. 

  • honk 1
Link to comment
Share on other sites

Chapter 4 - Moving everything to a new platform

 

With some testing done, I decided to migrate everything to proxmox VE. This carried 2 advantages:

  • It uses the KVM hypervisor which allows for a lot more tuning than Hyper-V, not to mention overhead so low you can run GTA V with no performance loss (Or 7 entire gaming VMs off of one PC)
  • It would actually allow the CPU to friggin boost

However, this meant I had to go through the hell of chapter 2 again. It took 3 attempts, but I finally had it done after a load of screaming. Upon being landed at a terminal, the very first thing I did was check clocks, and I was so happy to finally see this

image.png.ac26cd3fbf5065de01ea1fdff187dd74.png

From here, things started to look up, but there was still a long way to go, and many (unknown as of yet to me) problems, but hey, the CPU was turboing and I could start to make VMs again, and do more performance testing and optimisations. 

Link to comment
Share on other sites

Chapter 5 - The first performance test, AKA byond sucks on linux

 

My original plan was to run BYOND on linux to get maximum performance and the ability to use paravirtualisation (More in chapter 8). I had set everything up and was ready to go, so I closed the main server and funnelled everyone over to the test one. 

Good god it was horrible. Linux DD (DreamDaemon) does not play nice with our code at all. Movement stutters every 5 seconds and the server just chugs to play on. Something about it just becomes super unbearable to deal with and the entire experience was appauling, with a pretty compelling "This is way worse" vote. Armed with this information, it was time to go back to the drawing board. 

Link to comment
Share on other sites

Chapter 6 - Getting the game running on windows

 

I was wanting to avoid hosting on windows for many reasons, but most importantly, windows update. God I despise windows update. Whats this? You're running a production server and you have active connections doing shit? Too bad, update time. The VM itself had slept a few times before I switched to it over, which meant I had a lot of policy tweaking to do, to make sure that the windows VM never shut down under any circumstances to keep the game running. It took some doing, but I got there in the end.

The major disadvantage with windows is lack of paravirtualisation (More in chapter 8), this was a slight performance loss at first (It got much worse later), but I bit the bullet and deemed it fine enough to continue working, especially since the windows performance tests were proving fruitful. Players werent whining and people were starting to say it was better than before. Things were starting to look up. The only major hurdle was moving stuff such as game logs (runtime_summary from endround) from a windows host to a linux host, but that was easily overcome with SMB shares. 

For anyone curious, this is how minimal the windows install on the gameserver is

image.thumb.png.633a9c9cb7ce0c61b8591b4d66ac5c16.png

  • clown 1
Link to comment
Share on other sites

Chapter 7 - How to migrate with minimal downtime

 

The hardest part of any migration, is well, the migration itself. Moving services from A to B is hard enough on its own, combine that with the stress of downtime and people yelling "HEY AA WHEN IS THE SERVER BACK" doesn't help matters, but there were some ways to save time. 

The main part of the game server is the DB itself. That is the one thing thats always active and has to be moved as one block, so the way around that is to stop the main server, snapshot the DB, move it over, and reimport. This itself took time, but that gave me time to do other things, such as prepare webservices (forum/wiki) and DNS for the new boxes. 

Migration time is also important. Stats show that our lowpop time is about 10AM UTC

image.thumb.png.108754db2c6a35c4923f14ce66519bbe.png

Which means about 11AM on a Saturday morning. Wonderful. What a wonderful time to be awake. 

Link to comment
Share on other sites

Chapter 8 - Why emulation sucks and paravirtualisation is the future (EXTRA EXTRA: BYOND sucks)

 

So. The gameserver was now running on the new hardware. Everything was migrated, and it was all going nicely, apart from the gameserver having some spurious init time rises. 

image.png.0ceee97fb150175b4a9478a2313e3577.png

After a little investigation, I discovered that all of the things that increased were related to operations on the /icon type. Now this is where the "BYOND Sucks" part comes in.

Every time you do an operation on a /icon, its written to the HDD as cache. Don't ask me why BYOND doesnt use RAM, because that would make a ton more sense. Situations like roundstart have a lot of icon operations at once, which means the HDD gets absolutely thrashed.

image.png.ffb0d888af7990b6da59c1e1471f0f99.png

Spikes here are from mid MC initialize, with SSatoms being the first and SSair being the second (Both have lots of icon operations)

 

Now heres where emulation and paravirtualisation tie into this. When you are adding hardware to a virtual machine, you have 2 choices, you can either emulate it, or partavirtualise it (Commonly referred to as VirtIO). 

Emulation is a lot slower as it is having to pretend it is an actual device (Such as an intel e1000 network card), but since it is faking real hardware, it is much more compatible with other systems. Emulation also requires device emulation to live on a specific thread, and that thread jumps around a lot. In a worst case scenario, the SATA emulation process ends up on the same thread as BYOND, which makes for a really bad time) 

VirtIO on the other hand just says to the VM "Ok youre a VM, heres a direct pipe to the storage, go use it". This has much better overhead (I am talking literally half the time for IO operations), not to mention higher throughput (VirtIO network cards operate at 10 gigabits per second) 

image.png.a22cfacfba2afc7149cd6399391facfa.png

 

VirtIO drives are natively available on Linux, however since the game was running windows, they werent available by default. To make matters more complicated, windows does not like it when you change your HDD interface (IE: going from SATA -> SCSI). The network card was relatively easy, just install drivers, reboot, change NIC, reset IP, done.

The HDD however, I had to modify the boot image to preload the VirtIO drivers, then change the boot ID to make it accept, and it was just generally pain all around, not to mention being on time constraints because "AA THE SERVER IS DOWN WHAT ARE YOU DOING". But hey, it got sorted, and as an added benefit of paravirtualisation, I could tell the HDD IO process to live on its own thread outside of everything else, preventing resource bottlenecks.

 

This had major performance improvements and reduced a lot of roundstart inconsistencies, and just made the server more performant overall. 

Link to comment
Share on other sites

Chapter 9 - The woes of getting logs into a nice viewer without killing the CPU

 

So, if youre an admin you already know this, but admins have a fancy webtool thingy to look at logs (Courtesy of Denghis and his wizardry). However, you need some way to get the logs into that (In real time keep in mind). At first I ran the importer on the gameserver, but that caused servere resource issues, which I wasnt having because BYOND needs to be the most performant thing running on that machine.

To try and combat this, I made a shared folder for the logs and mapped it on the logging server. I thought this would be the solution, but the log importer doesnt like network shares and kept throwing errors out, crippling the import process.

A lot of tinkering and tuning later, we finally had the log importer running on the gamserver again without nuking the entire CPU. This made admins happy because they could use log webtools again. 

  • Like 1
Link to comment
Share on other sites

Chapter 10 (BONUS CHAPTER) - Remotely managing a server where you dont have a KVM/IPMI console 

 

So as I mentioned in chapter 2, IPMI is super nice as you can do full hardware level access without being there. However, the secondary server I bought did not have this as standard. Let me give some context.

As well as the primary server in NYC, we needed another server to handle being an offsite backup incase the NYC datacentre burns to the floor or theres a natural disaster or whatever. Because of location (and pricing), I got a backup server in germany. Germany was great since its almost on the opposite side of the world to NYC, and the way I see it, if there is a natural disaster that affects both NYC and germany at the same time, paradise isnt my priority, finding a nuclear bunker is. Anyways, the server had the following:

  • An i7-2600 CPU
  • 16GB RAM
  • 2x3TB HDDs
  • 1 Gigabit symmetrical internet

This was plenty for backup resources, and for only £25 a month it was great, however, this is where the problems begin. There is no IPMI, only a "rescue system", which is where the real chaos begins. To install the backup server OS, I had to:

  1. Boot into the rescue system
  2. Get an ISO of the backup server
  3. Make a VM in the rescue system that uses the real HDD as a virtual drive
  4. Open the VM window over VNC so I could actually use it and install stuff 
  5. Close the rescue system and restart the host OS, praying that it comes back up

.... and it didnt come back up.

Now, the provider does allow you to get a KVM console, but there are some catches:

  • You only get a 3 hour lease on it
  • Theres no virtual media or hardware control, just a keyboard, mouse and monitor
  • The compression and input latency is horrific

To give you an idea of what I mean on the third point, have you ever used a keyboard with a full second of latency on keyup and keydown? If you held a key for a split second, it would type 5 of that letter at the other end, which makes entering stuff you cant see (like passwords) very unfun.

Oh, and in regards to the compression, this is a windows login screen

image.thumb.png.e1444ec89d39b925be92489aec5aa5aa.png

However, in a weird turn of events, this was the rescue system with ludicrous compression

image.thumb.png.5a33732ef72505906c054ce06b4e66f3.png

It kinda has an old abandoned terminal aesthetic to it that I actually kinda like. 

 

After an hour of screwing around with horrific latency, I had the system actually boot properly and have its external IP set, which meant I could finally configure the backup server to backup the main server. 

Link to comment
Share on other sites

Conclusion post

 

This is just a tiny glimpse into the hell of hosting, and why it takes so many hours and why its so draining on people. This entire thread only covers some of my issues, and doesnt even go into

  • VPN Hell
  • Thread-pinning to avoid resource conflicts
  • Juggling multiple IPs 
  • DNS
  • Webserver mail sending hell
  • DNS
  • Did I mention DNS? 

 

So yeah, this is why the git has been a bit slow. Theres been a lot to deal with, and an entire mix of IRL hell too. But hey, atleast things are where they are now. 

  • Like 3
  • Thanks 5
Link to comment
Share on other sites

This is why I wish I could give you guys paid vacations.

 

But also then I wouldn't because that would mean something horrible would happen while you were gone.

 

Idk how many times I'll say this, but once again, thank you so much for the amazing job Arc.

  • Like 2
Link to comment
Share on other sites

Good writeup. More generally, good job with the setup.

Hopefully readers walk away from this with a better understanding of what being host is like.

  • Like 3
Link to comment
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue. Terms of Use