The massive Amazon Web Services outage that took down sites from Reddit to Ring to Roblox has been fixed, the company said. The AWS outage rendered huge portions of the internet unavailable for most of the work day for many people on Monday. As the day rolled along, the breakdown affected more than 2,000 companies and services, including Snapchat, Fortnite, Venmo, the PlayStation Network, Amazon itself and critical services such as online banking.
As of 3:53 p.m. PT, Amazon said that the massive issue was resolved. The company said the outage began at 11:49 p.m. on Sunday, with the company seeing increased error rates for services on the US East Coast. Amazon says its workers identified the source of the error at 12:26 a.m., blaming DNS resolution issues for the regional DynamoDB service endpoints. After that issue was resolved, Amazon faced additional problems, and had to throttle, meaning temporarily limit the power and performance, for certain operations.
"Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered," the latest update said. "By 3:01 p.m., all AWS services returned to normal operations."
Why were so many sites affected?
AWS, a cloud services provider owned by Amazon, props up huge portions of the internet. So when it went down, it took many of the services we know and love with it. As with the Fastly and Crowdstrike outages over the past few years, the AWS outage shows just how much of the internet relies on the same infrastructure -- and how quickly our access to the sites and services we rely on can be revoked when something goes wrong.
The reliance on a small number of big companies to underpin the web is akin to putting all of our eggs in a tiny handful of baskets. When it works, it's great, but only one small thing needs to go wrong for the internet to fall to its knees in a matter of minutes.
Outage reports spiked as the West Coast woke up
AWS first registered an issue on its service status page just after midnight PT on Monday, saying it was "investigating increased error rates and latencies for multiple AWS services in the US-East-1 Region." Around 2 a.m. PT, it said it had identified a potential root cause of the issue. Within half an hour, it had started applying mitigations that were resulting in significant signs of recovery.
"The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now," AWS said at 3.35 a.m. PT.
The issues seemed to have been largely resolved as the US East Coast was coming online, but outage reports spiked again dramatically after 8 a.m. PT as work began on the West Coast.
... continue reading