Finally, after five weeks (8, 9, 10, 11, 12) of framework upgrades and bug hunts, I found the time to work on actual feature development. Or so I thought…
My backend framework upgrade from week 8 didn’t let me off the hook quite that easily and kept a pesky out-of-memory error to strike sporadically (and in the middle of the night), rendering our account management backend inaccessible and thereby all of our game services at semi-broken. This left me no choice but to troubleshoot and fix the issue. And if the latter didn’t work, find countermeasures to prevent another downtime.
There’s a brief update about what I did in the linked incident report, but here are the steps in a bit more detail:
- Adjusted the configuration of the service to write so-called heap dumps whenever an out-of-memory error occurrs.
- Waited for one of these to be written.
- Analysed the heap dump, finding that the root cause is likely somewhere in 3rd-party code.
- Spent several hours trying to figure out whether and what code on my end might be triggering the stack overflow in the 3rd party code.
- Rewrote one suspicious implementation of a deprecated interface (which might have fixed the issue…or not…no further crash so far) but gave up on finding the root cause after that.
- Added a health check to the service’s container to mark it as “unhealthy” once it stops serving requests.
- Added an “auto-healer” container that restarts the service if it goes unhealthy.
- Got our ops-partner to add the service to their monitoring so there’s a higher chance of a human with admin access to notice that the service is down.
As said, the service was well-behaving since I’ve done the changes. But the problem only occurs sporadically, so it’ll take a few days until I am somewhat convinced the problem is gone and/or my counter-measures work sufficiently well.
Oh, and that UI design work…didn’t spend as much time on it as I was planning to. But I am very hopeful that I’ll actually make some progress on that front in week 14…