Yesterday I happened to be involved in an incident that caused an outage to new system deployed to AWS as Docker containers. We are already heavily invested in AWS, and chose to use the Elastic Beanstalk (PaaS) for its simplicity and speed, and the fact that support for multi-Docker containers is now available in Sydney. Whilst our experience with Beanstalk has been very positive to date, this particular issue was hard to track down and fully understand. However, as with many bugs, it has a very simple solution that is worth sharing.
Consider the following setup:
- A beanstalk application with 2 environments (we are using the excellent EB Deployer) – 1 active and another inactive
- Each beanstalk environment automatically comes with an Auto-scaling group (ASG), a load balancer (ELB) and let’s say, in our case, 2 instances
- It also comes with a DNS record we use to redirect traffic between these environments during deployments, ala Blue/Green
- Each instance will have 1 or more Docker containers with our applications installed on it
- We configure our main Docker container to have a health endpoint at /health that the ELB will use to take an instance in/out of service
Symptoms \ Investigation
In our case, we had monitoring configured at the service level as well as the instance level – if our service stopped performing the intended behaviour or our EC2 instances went red, we alerted. We tested killing machines and other standard auto-scaling events. What we noticed, was that at an obscure hour of the morning, our Beanstalk application went red. In our analysis, we found that:
- Within our instances, the Docker daemon progressively shut down across instances. There didn’t appear to be any trigger / cause for this and we’ve yet to explain why it died
- This of course failed our ELB health checks as the service was no longer available on those instances – it did not however break the actual service from the outside world, so alerting never went off
- The ELB correctly marked the instances unhealthy and removed them from the pool
- Auto-scaling, however, did nothing
- Eventually our Beanstalk application failed completely at which point a human was involved and attempted a rebuild (beanstalk operation) of the environment
- What we noticed, was that upon a rebuild of the environment, the latest application version (deployed to inactive) – not the version associated with the current environment – was then released, which was not what we wanted at all.
So there were 2 issues:
- Why didn’t the application automatically heal itself when it knew things were broken?
- Why does a rebuild use a different version?
Explanation of Events
In short, by default Beanstalk setup creates an Auto-scaling group (ASG), Load Balancer and several EC2 instances. The load balancer was set to monitor the instances via an HTTP health check at /health, where anything other than a 200 response results in the instances being removed from service. However, the auto-scaling group – the component responsible for scaling up/down but also terminating and replacing unhealthy instances, only checks the status of the underling EC2 instances and not the applications running on them. The result was this:
- The ELB eventually removed both unhealthy instances, as the docker processes on each instance had died
- At this point, the ELB had no instances to serve traffic to
- The ASG believed, however, it had 2 healthy instances as the machines that hosted the docker processes were perfectly fine
- Beanstalk was red, but didn’t know what to do to resolve itself
- The actual service was also, obviously, down at this point
- When we attempted a rebuild of the environment, Beanstalk didn’t take the version pinned to the previously green environment (now red) – it took the last attempted build which was a failure – thus compounding the issue further
As one user points out, the Beanstalk documentation is either not up to date, or very misleading on this subject – it certainly leads you to believe that the health check you setup is used for the ASG not only the ELB and there are no options in Beanstalk to change it – but there is a way to configure the application to instead use the ELB health check process to determine how to terminate/replace unhealthy instances (see here). In a lab environment, we were able to reproduce the above scenario completely and validate this hypothesis and then test out the new configuration to understand its behaviour. Under the new setup a failure at any level – load balancer health check, instance healthcheck or application level, the underlying EC2 instance is terminated and rebuild with the ASG guidelines without downtime. As for number 2, we need to do a bit more research to fully understand this behaviour or if some other user error was the root cause.
- PaaS’ are great to get moving quickly, and make things really easy for those without detailed understanding of the underlying platform. However it is a double edged sword: it makes it hard when it comes time to diagnose an issue and reproducing the platform locally is next to impossible
- Monitor & measure all the things – for us, even though we were able to recover without impacting customers, the fact that we have tools like Cloudwatch, Collectd, Graphite and a number of other monitoring tools at our disposal meant more power to diagnose a situation
- Ship logs – this saved us!
- Test every obscure scenario you can think of – kill docker daemons, processes and so and test to see if the system self-heals. Infrastructure is ephemeral, we can always rebuild it and should always be able to do so
- Ensure that the ‘inactive’ environment is always ready on stand-by. This way, if the ‘active’ environment breaks it is a quick switch to a running ‘inactive’ environment. Think of it as a cold standby
- Consider tearing down environments and recreating fresh, underlying, EC2 instances on each deployment – whilst it does slow things down a bit, it will prevent any sort of configuration drift
- Always ensure the latest application deployed to Beanstalk is Green, so that any redeploys are guaranteed to work and be the version you expect