AWS Post-Mortem: How a single DNS flaw broke Amazon's Cloud

A seemingly small DNS hiccup in Amazon's crucial us-east-1 region recently caused a huge outage that took down core services for several hours. It all kicked off when a race condition caused DynamoDB to vanish from the internet, which set off a chain reaction of failures. This analysis takes a look at what happened, explaining how one little glitch turned into a multi-system mess that took over 12 hours to fully fix.

What really happened to Amazon? Anatomy of a DNS failure in `us-east-1` that caused a cascading outage.

Well, let's hype up someone else's trouble a little. Though it's not quite exotic, many people have experienced it themselves. So, what’s the story that just happened with Amazon?

This may sound amusing, but we’re facing DNS issues again. Details are in the post-mortem, but it’s a lot of text in small print. Here’s the TL;DR:

There's an entire ecosystem that manages DNS records for DynamoDB. Load balancers are constantly being created and installed to control the load. They are monitored by a DNS planner, which maintains a plan (essentially a list of current balancers). These plans are picked up by so-called DNS Enactors, who then place them in Route53. There is one DNS Enactor per availability zone, that is, three, and they had a flight conditioning.

One of them began working slowly, while the other quickly updated everything. Then, one of them overwrote something, resulting in a loss of all IP addresses from dynamodb.us-east-1.amazonaws.com.

DynamoDB unexpectedly went down, affecting the majority of services that rely on it. It took about three hours to restore everything. However, that was just the beginning.

While DynamoDB was down, the EC2 scheduling subsystem determined that most of the droplets (by the way, droplets are not just in Digital Ocean) were unavailable. This was due to their statuses, stored in DynamoDB and marked as "busy / can be given to someone," being stuck during the outage. When DynamoDB came back online, a large number of EC2 instances that needed to be marked as "available" caused requests to time out, leaving instances unissued.

AWS engineers solved this by throttling requests; as a result, everything worked OK 12 hours after the incident began.

Finally, there were some amusing moments with the network when the network manager found himself overwhelmed by the backlog of updates that had accumulated over time. It took an additional five hours, along with three more hours for the final removal of restrictions.

The situation was further complicated by the fact that the us-east-1 region is particularly significant. It is home to critical services that impact the functionality of the entire AWS infrastructure, including the portal login page. As a result, issues also arose in other regions.

This situation reveals a fascinating story about the internal events that occurred, highlighting the key interactions and decisions that influenced the overall outcome. By reflecting on these experiences, we can draw significant conclusions that provide valuable insights, enabling us to improve our strategies and enhance our organizational effectiveness in the future.

How a single DNS flaw broke Amazon's Cloud

What really happened to Amazon? Anatomy of a DNS failure in `us-east-1` that caused a cascading outage.

Get the week's top DevOps & Platform Engineering content

Plan the present.
Build the future.

How a single DNS flaw broke Amazon's Cloud

What really happened to Amazon? Anatomy of a DNS failure in us-east-1 that caused a cascading outage.

Get the week's top DevOps & Platform Engineering content

Plan the present.Build the future.

What really happened to Amazon? Anatomy of a DNS failure in `us-east-1` that caused a cascading outage.

Plan the present.
Build the future.