Over the past couple of weeks there have been sporadic outages of our account backend that is used for authentication, among other things. If it goes down, pretty much all services that depend on authentication run into issues.
Unfortunately, when this happens at night, it takes a while until I notice. I am currently working on two things:
Resolving the root issue
Working with our ops partner to get pager duty in place for this service (technically, they are only responsible for the infrastructure, not the services that run on it)
I will keep you posted on my progress.
Sorry for the troubles and thanks a lot for your understanding!
Not much success here. The issue is likely a bug in the latest version of the web framework I updated to a few weeks back. Can’t rule out it’s triggered by something on my end of things, but so far I’ve failed to identify the root cause. To be continued…
The service is now under their watchful eye as well. On top of that, I have set up better health checks and automated restarts should the service become unresponsive. Let’s see how this goes…