Multi-Gaming Community

SpA technical '08 - '09

Last year was a technical difficult year. We had several real challenges ahead of us.

- The network freezes:

Starting off was the first and longest running  issue we've suffered up till now. Starting in may '07, shortly after our second birthday, our datacenter came under regular DDoS (Distributed Denial of Service) attacks.  In network world, these attacks are something you're going to have to endure every now and then, and all you can do is wait till such an attack dies out and stopping the traffic as much as possible at your network border.

Shortly after these attacks took place we started to get a highly annoying issue. Randomly, our servers seemed to freeze up for several seconds to suddenly resume all gaming traffic again, causing problems ingame where you'd get stuck for 1-10 seconds. Of course when such a freeze was over, you'd find yourself teleported to the other side of the level humping an enemy sentry gun.

At first we thought the problem was somewhere in our own systems. We started to check and double check everything we could, but we could not discover any issues in or on our own equipment. Logically that would mean the network was causing all the harm.

Several times we've informed our datacenter host 'Leaseweb' about our problems, but after some research, they'd keep coming back to us telling their network is fine, and our servers are to blame. Because of this info we turned back to our own machines again. Again and again testing and running through every possible thing we could, but nowhere we could find the source of these freezing issues.

Somewhere after months of enduring these problems, our techteam got expanded with 2 experienced Linux system administrators. In the end, they would turn out to be the key to solving our problems.

Putting our heads together we developed several tools to test out the network through and through. With each test app we wrote, we got the same results: The network was causing the issue.

With all this information gathered we contacted Leaseweb again. But for whatever reason they kept telling our test results were bad and could not be trusted because we used own written apps for it (it was basically the only way to prove we were right). After countless hours of calling, mailing and escalating the issue to the manager over there, we finally got an engineer assigned who was willing to assist us in the situation. Together with this engineer we created a new test plan. They would hang a server of theirs below ours, and we would start monitoring with a tool of their choice.

Luckily also on their machine, with their tool of choice we got the same results. You'd say a final conclusion could be drawn soon from those results. Unfortunately we got proven otherwise. The engineer assisting us was not a network engineer, but a Linux engineer. So he on his turn had to convince the Leaseweb network team as well (which apparently was a David vs. Goliath fight).  Stubborn as they were, they kept denying that the issue was caused by the network.

At a certain point, it came to us telling  leaseweb they either seriously look at the problem and fix the issue, or we'd pack our stuff and move elsewhere.  Finally we got some good news. After the weekend they would setup a meeting between the engineer helping us, the manager from the support desk and engineers from network operations for one final look at the problem.

In the afternoon I received a call, informing us the problems should be over. Imagine  the surprised look on our faces!

In the end it turns out that when all of the DDoS attacks started (some 11(!) months earlier) they applied some network filter rules causing an important router to clear its ARP cache every now and then causing all network traffic to stall for several seconds...

As closure to this problem we received 6 months of free services, and a box of leaseweb pens and stickers (our members are to blame for that. They rather have a pen then a refund :) ). 

- The new Mr. Blonde game server:

Last year we kept a fund raiser to get money for a new server. We wanted to replace our main game server (an old Opteron 180 dual core) for a new kick ass system to ensure we could keep on growing. The response was huge (thanks for that!) and within several months we gathered 2000 euro's together to buy the new machine.

As it was such an expensive machine we figured we could as well pay 35 euro's extra to have the company assemble and configure the machine properly (bios settings and such).  In the future we'll save ourselves the 35 euro's, as it once again turned out that if you want something done properly, you can better do it yourself.

When we received the machine we directly started with installing it. First we attempted to install Linux Debian on it, but for some strange reason it would keep locking up during install. A test with a windows OS gave the same results.

As we bought it pre-assembled and configured I decided to directly return the machine.
A few days later we could pick up the machine again. They claimed the machine had a faulty keyboard and said to have swapped it.

Another attempt was done on installing it, but at booting we were surprised by the following;
Detected: 1 CPU, Intel Xeon

Uhmm... Did we not buy a dual socket machine with 2 CPU's in it? Yes we did. So once again we returned the machine to the shop. Receiving it a few days later again, claiming a broken mobo which got replaced.

Now finally we had a machine which seemed to operate as it should. All hardware got detected AND we could install an OS on it!

After some testing at home the machine looked good enough to get moved in the datacenter. We removed old Mr. Blonde and installed the new machine with high expectations on it. A nasty surprise that would give.

As soon as one of the TF2 servers filled the problems started to play up again. The machine performed worse than the old Opteron 180.... It gave an insane high CPU load and very laggy performance ingame. After doing some research and being unable to locate the issue we decided to go back to the datacenter and place back the old machine so we could continue testing on the new machine in attempt to locate the issue.

Unable to locate the issue we decided again to return the server to the store. They were also unable to find any issue and claimed that there was no problem going on.

Back in base we decided to contact Tyan, the motherboard/case manufacture we use for our servers. After explaining the issues we had to endure the Tyan support employee provided us with a Beta bios release for the motherboard we were using.  After installing this beta bios and setting some custom settings we got the performances to a good level, finally!

- Mr. White, the harddisk hell:

On our database/web/etc server we endured a whole different set of issues. For some reason, harddisks kept dying on us.

After the first harddisk crash we went to the datacenter to restore the machine.  A simple reboot and disk check seemed to have solved the problems. The machine booted properly again and ran stable from what we could see.

In the weeks after the machine started to crash more and more at random times with no load at all.

To prevent greater problems in the future we ordered 4 500GB Samsung F1 spinpoint disks to be put in a RAID1+0 config. We planned to go to the datacenter in the weekend, and of course on Friday the machine would crash again, this time, unable to get recovered.

To ensure we had enough time to properly test and configure the new raid setup we temporary replaced the machine with a spare we had. Of course, as you could guess, the HDD from this machine died on us as well.

In the end we still had to make it a quick job (although we had some 3 weeks or so to test the stuff, we were just not quite done yet).  We placed back Mr. White in the datacenter with VMware on it, allowing us to host a Linux install for 99.9% of our tasks, and a windows 2003 server for our Exchange/mail environment.

Unfortunately the new setup was causing some problems for us.  Whenever we would do something inside the windows OS, Linux would completely stall. After a lot of research and asking around, we found that we simply did not have enough memory in the machine. Soon after we found out, we upgraded the machines memory from 4GB 800HMz non-ECC modules to 8GB 1333MHz ECC modules. This heavily improved the performances and so far we haven't ran into any issues anymore on this machine. (we do have some upgrades in mind for the future, you'll be able to read about that tomorrow).

Comments

Re: SpA technical '08 - '09

I liked the part about the pen the most!