Cloudflare blames this week's massive outage on database issues

On Tuesday, Cloudflare experienced its worst outage in 6 years, blocking access to many websites and online platforms for almost 6 hours after a change to database access controls triggered a cascading failure across its Global Network.

The company's Global Network is a distributed infrastructure of servers and data centers across more than 120 countries, providing content delivery, security, and performance optimization services and connecting Cloudflare to over 13,000 networks, including every major ISP, cloud provider, and enterprise worldwide.

Matthew Prince, the company's CEO, said in a post-mortem published after the outage was mitigated that the service disruptions were not caused by a cyberattack.

"The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a "feature file" used by our Bot Management system," Prince said.

The outage began at 11:28 UTC when a routine database permissions update caused Cloudflare's Bot Management system to generate an oversized configuration file containing duplicate entries. The file, which exceeded the built-in size limits, caused the software to crash while routing traffic across Cloudflare's network.

This database query returned duplicate column metadata after permissions changes, doubling the feature file from approximately 60 features to over 200, exceeding the system's hardcoded 200-feature limit designed to prevent unbounded memory consumption.

5xx error HTTP status codes during outage (Cloudflare)

Every five minutes, a query generated either correct or faulty configuration files, depending on which cluster nodes had been updated, causing the network to fluctuate between working and failing states.

Additionally, when the oversized file propagated across network machines, the Bot Management module's Rust code triggered a system panic and 5xx errors, crashing the core proxy system that handles traffic processing.

Core traffic returned to normal by 14:30 UTC after Cloudflare engineers identified the root cause and replaced the problematic file with an earlier version. All systems were fully operational by 17:06 UTC. The outage affected Cloudflare's core CDN and security services, Turnstile, Workers KV, dashboard access, email security, and access authentication.

... continue reading