Postmortem of yesterday's downtime

Yesterday we had a bad outage. From 22:25 to 22:58 most of our servers were down and serving 503 errors. As is common with these scenarios the cause was cascading failures which we go into detail below.

Every day we serve millions of API requests, and thousands of businesses depend on us - we deeply regret downtime of any nature, but it's also an opportunity for us to learn and make sure it doesn't happen in the future.

Below is yesterday's postmortem. We've taken several steps to remove single point of failures and ensure this kind of scenario never repeats again.


Timeline

While investigating high CPU usage on a number of our CoreOS machines we found that systemd-journald was consuming a lot of CPU.

Research led us to https://github.com/coreos/bugs/issues/1162 which included a suggested fix. The fix was tested and we confirmed that systemd-journald CPU usage had dropped significantly. The fix was then tested on two other machines a few minutes apart, also successfully lowering CPU use, with no signs of service interruption.

Satisfied that the fix was safe it was then rolled out to all of our machines sequentially. At this point there was a flood of pages as most of our infrastructure began to fail. Restarting systemd-journald had caused docker to restart on each machine, killing all running containers. As the fix was run on all of our machines in quick succession all of our fleet units went down at roughly the same time, including some that we rely on as part of our service discovery architecture. Several other compounding issues meant that our architecture was unable to heal itself. Once key pieces of our infrastructure were brought back up manually the services were able to recover.

Causes

The start of the cascading failure was restarting systemd-journald on all machines in quick succession. If we had known/noticed that it caused a restart of dockerd we wouldn’t have applied this fix on live machines, we would have adjusted our machine building configuration (we use terraform) to build new machines with the fix in place and slowly replaced our machines.

Mass failure of fleet units makes it very likely units will be relocated to machines they weren’t previously on. For the unit to start on a new machine a docker image must be present on the machine, or docker must be able to pull the required image from our private docker registry. Unfortunately we run our docker registry inside a docker container and it wasn’t able to come up cleanly (for reasons I explain below). This meant that a large number of migrated units weren’t able to pull the images they needed, causing them to fail.

We are trialling consul as a discovery mechanism and key/value store. We run our consul servers and agents inside docker containers. The containers are run using something like: docker run --rm --name consul … which means that docker should create a container named consul and delete the container when the container stops. Unfortunately dockerd doesn’t always clean up the container properly, which meant that when systemd tried to restart our consul servers docker complained that the container consul already existed and refused to run.

To test out consul and get a feel for it we had moved some of our docker registry configuration into it. As consul was down, the docker registry was unable to fetch some of its configuration and so was unable to start.

We were using docker containers to run our consul servers and hadn’t added a volume for the data consul saves to disk. This meant that when all of our consul servers were down at the same time, we lost the data stored in the key/value store (thankfully nothing we didn’t have other copies of).

Improvements

1) Try to avoid running any fix on live machines unless it’s an emergency, prefer adjusting build configuration and letting the improved machines slowly replace the old.

2) Don’t run our own private docker registry, use a third party. We’ve had a few issues with the docker registry being unavailable and it cripples our ability to deploy or recover easily from failures. We’re looking into third party docker registries, although none are standing out as clear winners there.

3) Ensure that we forcibly remove old containers before trying to start new ones.

4) Run consul servers with a persistent data store, and don’t run them via fleet or docker. We have now changed our architecture so that consul no longer relies on fleet or docker, improving our stability and ability react when issues arise. The fact that consul is written in golang is a big plus here as it means we can easily install the binary onto the CoreOS machine without needing to use docker or install a large number of dependencies. With this change we have moved towards an architecture where infrastructure related services are all run directly by systemd without using docker and our application tier is run via fleet using docker containers.