Massive Azure outage is over, but problems linger - here's what happened

picture alliance/Contributor/picture alliance via Getty Images Follow ZDNET: Add us as a preferred source on Google. ZDNET's key takeaways Microsoft Azure experienced a global outage on October 29. Microsoft customer-facing services were affected. Recovery came later that same day, but some problems linger. Last week, Amazon Web Services (AWS) went down, and many of us were miserable. This week, it's Microsoft Azure's turn to fall down and go boom, and once more, we're pretty darn unhappy about it. Microsoft states that the latest Azure outage began at approximately noon ET on October 29. However, Downdetector, which relies on user reports, shows the problems surfaced earlier, around 11:40 a.m. Also: The massive AWS outage that broke half the internet is finally over - here's what happened ThousandEyes, the Cisco network security company, "detected HTTP timeouts, server error codes, and elevated packet loss at the edge of Microsoft's network, preventing successful connections to affected services and frequently timing out or returning service-related errors." The latest status update As of 5:30 p.m. ET, October 29, Microsoft reported, "We initiated the deployment of our 'last known good' configuration, which has now successfully completed. We are currently recovering nodes and re-routing traffic through healthy nodes." Don't get too excited, though. We're not done yet. Microsoft continued, "As recovery progresses, some requests may still land on unhealthy nodes, resulting in intermittent failures or reduced availability until more nodes are fully restored. This recovery effort involves reloading configurations and rebalancing traffic across a large volume of nodes to restore full operational scale. The process is gradual by design, ensuring stability and preventing overload as dependent services recover. We expect continued improvement across affected regions. This means we expect recovery to happen by 23:20 UTC on 29 October 2025." That's 7:30 p.m. ET. In reality, it took a bit longer. Azure reported that it was back to normal by 8:05 p.m. yesterday. Even then, Microsoft warned that customer configuration changes to Azure Front Door (AFD) would remain temporarily blocked. Microsoft promised it would notify customers once this block has been lifted. In addition, while "error rates and latency are back to pre-incident levels, a small number of customers may still be seeing issues, and we are still working to mitigate this long tail." If you're still having trouble today, talk to Azure. If things are really fouled up, Microsoft recommends you consider implementing existing failover strategies using Azure Traffic Manager to redirect traffic from Azure Front Door to their origin servers as an interim measure." This is far from an easy fix. If your staff isn't experienced with Azure traffic routing, I'd grit my teeth and wait for Azure to come completely back online. Also: No one pays ransomware demands anymore - so attackers have a new goal Unlike the AWS failure, which -- while massive in its damage -- was limited to a single region (AWS East), according to the Azure Status page as of 1:30 p.m. ET, all Azure regions were down. Tracing the faulty deployment We still don't have a final report on what happened. At first, Microsoft only said, "Starting at approximately 16:00 UTC, we began experiencing Azure Front Door (AFD) issues resulting in a loss of availability of some services. We suspect that an inadvertent configuration change was the trigger event for this issue. We are taking two concurrent actions where we are blocking all changes to the AFD services and, at the same time, rolling back to our last known good state." Microsoft's initial report on the incident stated, "An inadvertent tenant configuration change within AFD triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery." The change caused an invalid configuration state, which, in turn, resulted in a significant number of AFD nodes failing to load properly, including increased latencies, timeouts, and connection errors for downstream services. In other words, it was a complete mess. Also: Best VPN services 2025: Our top picks for speed and security As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even in partially healthy regions. Microsoft immediately "blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a 'last known good' configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue." The fault has been traced back to a faulty tenant configuration deployment process. "Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect that allowed the deployment to bypass safety validations. Safeguards have since been reviewed, and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future." Although it's not mentioned in this document, early Azure reports put some of the blame on -- you guessed it! -- a Domain Name System (DNS) problem. Say it with me: When there's a network problem, "It's always DNS!" It's always DNS. sjvn Which sites and services were affected? Ordinary people felt the pain as well. Popular services such as Microsoft 365 and Microsoft Intune for business users and Xbox Live and Minecraft for people just wanting to have fun have also been down. Others reported that Microsoft logins were also slowing to a crawl or failing entirely. The following services were affected: Microsoft 365 Microsoft Azure Microsoft Copilot Microsoft Entra Microsoft Store Microsoft Teams Minecraft Xbox It was a bad day if you relied on Microsoft. Alaska Airlines suffered interruptions to its critical internal systems, including its website and operational infrastructure. Vodafone in the UK and Heathrow Airport were also reported to have been affected by the outage. Behind the scences, Microsoft now reports that the following Azure services were affected: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Container Registry, Media Services, Microsoft Defender External Attack Surface Management, Microsoft Entra ID, Microsoft Purview, Microsoft Sentinel, Video Indexer, and Virtual Desktop. Earlier, Ookla telecom analyst Luke Kehoe said, "Microsoft Azure has knocked many services offline worldwide, with a wide blast radius across airlines, banks, and government agencies. It is the second such event this month, highlighting the systemic risks of concentration and single points of logical failure, regardless of how physically hardened the infrastructure is." Also: Microsoft's revamped Windows 11 Start menu is rolling out - but I'll stick with my favorite alternative He's got a point. We rely too heavily on AWS, Azure, and other cloud services, which, when the going gets tough, turn out to be single points of failure. Be that as it may, in its latest quarterly report, which came after the bell on the same day, Microsoft reported that it beat Wall Street estimates and that Azure's income grew by about 40%. Still, with this ongoing failure and Microsoft admitting that it can't keep up with AI and cloud demands, Microsoft's stock sank lower in after-market trading. Get the morning's top stories in your inbox each day with our Tech Today newsletter.

Massive Azure outage is over, but problems linger - here's what happened

Share this article

Related Articles