Monday's Massive AWS Outage Explained: Looks Like It's Finally Over

The massive Amazon Web Services outage that took down sites from Reddit to Ring to Roblox has been fixed, the company said. The AWS outage rendered huge portions of the internet unavailable for most of the work day for many people on Monday. As the day rolled along, the breakdown affected more than 2,000 companies and services, including Snapchat, Fortnite, Venmo, the PlayStation Network, Amazon itself and critical services such as online banking. As of 3:53 p.m. PT, Amazon said that the massive issue was resolved. The company said the outage began at 11:49 p.m. on Sunday, with the company seeing increased error rates for services on the US East Coast. Amazon says its workers identified the source of the error at 12:26 a.m., blaming DNS resolution issues for the regional DynamoDB service endpoints. After that issue was resolved, Amazon faced additional problems, and had to throttle, meaning temporarily limit the power and performance, for certain operations. "Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered," the latest update said. "By 3:01 p.m., all AWS services returned to normal operations." Why were so many sites affected? AWS, a cloud services provider owned by Amazon, props up huge portions of the internet. So when it went down, it took many of the services we know and love with it. As with the Fastly and Crowdstrike outages over the past few years, the AWS outage shows just how much of the internet relies on the same infrastructure -- and how quickly our access to the sites and services we rely on can be revoked when something goes wrong. The reliance on a small number of big companies to underpin the web is akin to putting all of our eggs in a tiny handful of baskets. When it works, it's great, but only one small thing needs to go wrong for the internet to fall to its knees in a matter of minutes. Outage reports spiked as the West Coast woke up AWS first registered an issue on its service status page just after midnight PT on Monday, saying it was "investigating increased error rates and latencies for multiple AWS services in the US-East-1 Region." Around 2 a.m. PT, it said it had identified a potential root cause of the issue. Within half an hour, it had started applying mitigations that were resulting in significant signs of recovery. "The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now," AWS said at 3.35 a.m. PT. The issues seemed to have been largely resolved as the US East Coast was coming online, but outage reports spiked again dramatically after 8 a.m. PT as work began on the West Coast. As of 8:43 a.m. PT, the AWS status page showed the severity as "degraded." In a post at that time, AWS noted: "We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations." (EC2 is AWS shorthand for Amazon Elastic Compute Cloud, a service that it says "provides secure, resizable compute capacity in the cloud.") Amazon didn't respond to a request for further comment beyond pointing us back to the AWS health dashboard. The AWS outage first peaked before dawn Monday in the US, then subsided, and surged again around midday. Downdetector/Screenshot by CNET Around the time that AWS says it first began noticing error rates, the outage-tracking site Downdetector saw reports begin to spike across many online services, including banks, airlines and phone carriers. As AWS resolved the issue, some of these reports saw a drop-off, whereas others have yet to return to normal. (Downdetector is owned by the same parent company as CNET, Ziff Davis.) Around 4 a.m. PT, Reddit was still down, while services including Ring, Verizon and YouTube were still seeing a significant number of reported issues. Reddit finally came back online around 4.30 a.m. PT, according to its status page, which was then verified by CNET. In total, Downdetector saw over 9.8 million reports, with 2.7 million coming from the US, over 1.1 million from the UK and the rest largely spread across Australia, Japan, the Netherlands, Germany and France. Over 2,000 companies in total have been affected, Downdetector added, without around 280 still experiencing issues around 10 a.m. PT. "This kind of outage, where a foundational internet service brings down a large swath of online services, only happens a handful of times in a year," Daniel Ramirez, Downdetector by Ookla's director of product told CNET. "They probably are becoming slightly more frequent as companies are encouraged to completely rely on cloud services and their data architectures are designed to make the most out of a particular cloud platform." What caused the AWS outage? AWS didn't immediately share full details about what caused the internet to fall off a cliff this morning. Then at 8:43 a.m. PT, it offered this brief description: "The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers." Earlier in the day it had attributed the outage to a "DNS issue." DNS stands for the domain name system and refers to the service that translates human-readable internet addresses (for example, CNET.com) into machine-readable IP addresses that connect browsers with websites. The internet came to its knees with many sites reporting outages early Monday, according to Downdetector. Downdetector/Screenshot by CNET When a DNS error occurs, the translation process cannot take place, interrupting the connection. DNS errors are common internet roadblocks, but usually happen on a small scale, affecting individual sites or services. Because the use of AWS is so widespread, a DNS error can have equally widespread results. According to Amazon, the issue is geographically rooted in its US-East-1 region, which refers to an area of northern Virginia where many of its data centers are based. It's a significant location for Amazon, as well as many other internet companies, and it props up services spanning the US and Europe. "The lesson here is resilience," said Luke Kehoe, industry analyst at Ookla. "Many organizations still concentrate critical workloads in a single cloud region. Distributing critical apps and data across multiple regions and availability zones can materially reduce the blast radius of future incidents." Was the AWS outage caused by a cyberattack? DNS issues can be caused by malicious actors, but there's no evidence at this stage to say that this is the case for the AWS outage. Technical faults can, however, pave the way for hackers to look for and exploit vulnerabilities when companies' backs are turned and defenses are down, according to Marijus Briedis, CTO at NordVPN. "This is a cybersecurity issue as much as a technical one," he said in a statement. "True online security isn't only about keeping hackers out, it's also about ensuring you can stay connected and protected when systems fail." When such an outage happens, people should look out for scammers hoping to take advantage of people's awareness of the outage, Briedis added. You should be extra wary of phishing attacks and emails telling you to change your password to protect your account.

Monday's Massive AWS Outage Explained: Looks Like It's Finally Over

Share this article

Related Articles