We wanted to provide you with some additional information about the service disruption that occurred in the N. Virginia (us-east-1) Region on October 19 and 20, 2025. While the event started at 11:48 PM PDT on October 19 and ended at 2:20 PM PDT on October 20, there were three distinct periods of impact to customer applications. First, between 11:48 PM on October 19 and 2:40 AM on October 20, Amazon DynamoDB experienced increased API error rates in the N. Virginia (us-east-1) Region. Second, between 5:30 AM and 2:09 PM on October 20, Network Load Balancer (NLB) experienced increased connection errors for some load balancers in the N. Virginia (us-east-1) Region. This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs. Third, between 2:25 AM and 10:36 AM on October 20, new EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM, some newly launched instances experienced connectivity issues which were resolved by 1:50 PM.
DynamoDB
Between 11:48 PM PDT on October 19 and 2:40 AM PDT on October 20, customers experienced increased Amazon DynamoDB API error rates in the N. Virginia (us-east-1) Region. During this period, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service. The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.
Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality. Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region. Automation is crucial to ensuring that these DNS records are updated frequently to add additional capacity as it becomes available, to correctly handle hardware failures, and to efficiently distribute traffic to optimize customers’ experience. This automation has been designed for resilience, allowing the service to recover from a wide variety of operational issues. In addition to providing a public regional endpoint, this automation maintains additional DNS endpoints for several dynamic DynamoDB variants including a FIPS compliant endpoint, an IPv6 endpoint, and account-specific endpoints. The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair. To explain this event, we need to share some details about the DynamoDB DNS management architecture. The system is split across two independent components for availability reasons. The first component, the DNS Planner, monitors the health and capacity of the load balancers and periodically creates a new DNS plan for each of the service’s endpoints consisting of a set of load balancers and weights. We produce a single regional DNS plan, as this greatly simplifies capacity management and failure mitigation when capacity is shared across multiple endpoints, as is the case with the recently launched IPv6 endpoint and the public regional endpoint. A second component, the DNS Enactor, which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service. For resiliency, the DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs). Each of these independent instances of the DNS Enactor looks for new plans and attempts to update Route53 by replacing the current plan with a new plan using a Route53 transaction, assuring that each endpoint is updated with a consistent plan even when multiple DNS Enactors attempt to update it concurrently. The race condition involves an unlikely interaction between two of the DNS Enactors. The normal way things work a DNS Enactor picks up the latest plan and begins working through the service endpoints to apply this plan. This process typically completes rapidly and does an effective job of keeping DNS state freshly updated. Before it begins to apply a new plan, the DNS Enactor makes a one-time check that its plan is newer than the previously applied plan. As the DNS Enactor makes its way through the list of endpoints, it is possible to encounter delays as it attempts a transaction and is blocked by another DNS Enactor updating the same endpoint. In these cases, the DNS Enactor will retry each endpoint until the plan is successfully applied to all endpoints. Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. Therefore, this did not prevent the older plan from overwriting the newer plan. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors. This situation ultimately required manual operator intervention to correct.
When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB. Customers with DynamoDB global tables were able to successfully connect to and issue requests against their replica tables in other Regions, but experienced prolonged replication lag to and from the replica tables in the N. Virginia (us-east-1) Region. Engineering teams for impacted AWS services were immediately engaged and began to investigate. By 12:38 AM on October 20, our engineers had identified DynamoDB’s DNS state as the source of the outage. By 1:15 AM, the temporary mitigations that were applied enabled some internal services to connect to DynamoDB and repaired key internal tooling that unblocked further recovery. By 2:25 AM, all DNS information was restored, and all global tables replicas were fully caught up by 2:32 AM. Customers were able to resolve the DynamoDB endpoint and establish successful connections as cached DNS records expired between 2:25 AM and 2:40 AM. This completed recovery from the primary service disruption event.
Amazon EC2
Between 11:48 PM PDT on October 19 and 1:50 PM PDT on October 20, customers experienced increased EC2 API error rates, latencies, and instance launch failures in the N. Virginia (us-east-1) Region. Existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event. After resolving the DynamoDB DNS issue at 2:25 AM PDT, customers continued to see increased errors for launches of new instances. Recovery started at 12:01 PM PDT with full EC2 recovery occurring at 1:50 PM PDT. During this period new instance launches failed with either a “request limit exceeded” or “insufficient capacity” error.
To understand what happened, we need to share some information about a few subsystems that are used for the management of EC2 instance launches, as well as for configuring network connectivity for newly launched EC2 instances. The first subsystem is DropletWorkflow Manager (DWFM), which is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances – we call these servers “droplets”. The second subsystem is Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances. Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management. This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems. As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.
Starting at 11:48 PM PDT on October 19, these DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete. While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting. Between 11:48 PM on October 19 and 2:24 AM on October 20, leases between DWFM and droplets within the EC2 fleet slowly started to time out.
At 2:25 AM PDT, with the recovery of the DynamoDB APIs, DWFM began to re-establish leases with droplets across the EC2 fleet. Since any droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs were returning “insufficient capacity errors” for new incoming EC2 launch requests. DWFM began the process of reestablishing leases with droplets across the EC2 fleet; however, due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases. Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues. After attempting multiple mitigation steps, at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation. Restarting the DWFM hosts cleared out the DWFM queues, reduced processing times, and allowed droplet leases to be established. By 5:28 AM, DWFM had established leases with all droplets within the N. Virginia (us-east-1) Region and new launches were once again starting to succeed, although many requests were still seeing “request limit exceeded” errors due to the request throttling that had been introduced to reduce overall request load.
... continue reading