Cloudflare outage should not have happened

Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future November 26, 2025 by Eduardo Bellani

Yet again, another global IT outage happen (deja vu strikes again in our industry). This time at cloudflare(Prince 2025). Again, taking down large swats of the internet with it(Booth 2025).

And yes, like my previous analysis of the GCP and CrowdStrike’s outages, this post critiques Cloudflare’s root cause analysis (RCA), which — despite providing a great overview of what happened — misses the real lesson.

Here’s the key section of their RCA:

Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database: SELECT name, type FROM system.columns WHERE table = ‘http_requests_features’ order by name; Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database. This, unfortunately, was the type of query that was performed by the Bot Management feature file generation logic to construct each input “feature” for the file mentioned at the beginning of this section. The query above would return a table of columns like the one displayed (simplified example): However, as part of the additional permissions that were granted to the user, the response now contained all the metadata of the r0 schema effectively more than doubling the rows in the response ultimately affecting the number of rows (i.e. features) in the final file output.

A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

So, a new underlying security work manifested the (unintended) potential already there in the query. Since this was by definition unintended, the application code didn’t expect that value to be what it was, and reacted poorly. This caused a crash loop across seemingly all of cloudflare’s core systems. This bug wasn’t caught during rollout because the faulty code path required data that was assumed to be impossible to be generated.

Sounds familiar? It should. Any senior engineer has seen this pattern before. This is classic database/application mismatch. With this in mind, let’s review how Cloudflare is planning to prevent this from happening again:

Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

Enabling more global kill switches for features

... continue reading