Stapleton/Aspern Post-Mortem

[EN]

This is a rather unusual dev-log as it refers to recent events beyond the mere development of the game. Therefore I chose a Q&A style. I hope this will answer most of the questions you might have about the recent problems plaguing Aspern, Stapleton and, to some extend, Quimby.

What had happened?

The launch of the "Spring Cleaning Patch" (see previous dev-logs for reference) already went everything but smooth. But after a while we saw deteriorating performance for the Stapleton and Aspern game worlds. The background job that "runs" the simulation - since the spring cleaning I tend to call it the "pacemaker" - just failed to keep up with the amount of work it had to do and consequently fell back, causing a noticeable backlog. All operations like take-offs and contract updates would be delayed. This was due to the fundamental changes to the framework introduced by the spring cleaning patch.

Why did you introduce these changes in the first place?

Some people say "never touch a running system". But without "touching" a system, there won't be any progress. We decided to implement the changes to the framework for these two main reason:

  1. We wanted to get rid of the "Concurrent Access Errors" people were seeing all the time.
  2. We wanted to lay the groundwork for major future enhancements that would not be possible otherwise.

As with any modifications of such scale, we anticipated a certain amount of "teething issues". Although we would have preferred a few less of them.

Why did these changes cause such problems?

The update changes the way the pacemaker does it's work. Previously, most tasks would be taken out in batches. Take the flight update for example: The background job would load all pending flights - sometimes several hundred at a time - and simply "update" them in one go. In the process, it had to "touch" hundreds of financial accounts, aircraft, statistics records etc.. If a player or someone else would try to write to any of these items, he would see the dreadful "Concurrent Access Error" (CAE). Also, these bulk updates, while being quite optimized, are very inflexible. Too inflexible for the things we have planned for the future.

The new system works differently: It loads the IDs of all affected aircraft (not flights!) and sends a message with that ID to a work queue for the enterprise that operates the aircraft, telling it do run the "operations" for this particular aircraft. The advantage: Things are now happening in strict sequence for a single enterprise so the risk of causing a CAE are drastically reduced. We are also able to handle more complicated logic during the update. The bad side: We are causing quite a bit of overhead because hundreds of aircraft and their flights have to loaded in sequence. A lot of the optimizations that went into the bulk updates in the past have been lost.

There is also a technical problem that we simply didn't take into account enough: Modern database systems are quite good at preventing row-level conflicts. If two transactions want to change the same row it will cause an error, rolling back one of the two transactions and ensuring a consistent state of the database. As long as different data is accessed, many processes can access it concurrently without any problems. But databases also use something called "indexes" to know where in the dataset certain records can be found. And if you want to change such an index - by adding a new record, deleting a record or changing a value that is used for indexing - transactions have to be handled in sequence again. In our case that means: Even though we are updating many aircraft in parallel, they still have to wait for each other because of index updates, oftentimes causing deadlocks which would take a very long time to be resolved. Since there's only a limited amount of queues worked in parallel, this would of course cause delays and wait times for both user- and system-operations.

Why did it affect Aspern and Stapleton the most?

To be honest, we are not 100% sure. So here are our guesses: The problems described above can go unnoticed if the amount of operations is low and/or access to the database is very fast. Stapleton, Aspern and Quimby were all part of a trial in which we wanted to see whether we can host three game worlds on one physical machine (so far, one machine hosts only two game worlds). While Quimby is still very young and has a comparably small database, Aspern is a very  active game world with almost 1000 players while Stapleton is very mature with about 6% more flights than our average game world. Even though all our servers run on SSDs, three game worlds accessing the same disks was probably too much. Database operations were slowed down by this, amplifying the locking issues described above. Once both game worlds had built up a backlog, they were constantly working to get rid of it, causing even more traffic on the database. And so on.

This theory is backed by the fact that the backlog did stabilize only after we moved Quimby to another machine and Stapleton only really started to pick up after Aspern was back to normal.

How can such problems be avoided in the future?

The only way to avoid these issues is to optimize. And then optimize some more.

Now that the situation has somewhat normalized we will discuss the events and draw our conclusions. We will need to think about the data structures we use and try to optimize them for the highly concurrent access the new system requires and future developments will rely on. We also need to get more insights into the guts of our new system, doing more and broader analysis of execution and wait times to spot the bottlenecks in our code.

We have scrapped the idea of hosting more than two game worlds on one box for now, at least until we've figured out a way to make it work. 

The pending optimization work will keep us busy for quite some time to come and you can be sure to read all about it here on the dev-log. 

[DE]

Die deutsche Übersetzung folgt so schnell wie möglich.