Tech News
← Back to articles

Massive Cloudflare outage was triggered by file that suddenly doubled in size

read original related products more articles

When a Cloudflare outage disrupted large numbers of websites and online services yesterday, the company initially thought it was hit by a “hyper-scale” DDoS (Distributed Denial-of-Service) attack.

“I worry this is the big botnet flexing,” Cloudflare co-founder and CEO Matthew Prince wrote in an internal chat room yesterday, while he and others discussed whether Cloudflare was being hit by attacks from the prolific Aisuru botnet. But upon further investigation, Cloudflare staff realized the problem had an internal cause: an important file had unexpectedly doubled in size and propagated across the network.

This caused trouble for software that needs to read the file to maintain the Cloudflare bot management system that uses a machine learning model to protect against security threats. Cloudflare’s core CDN, security services, and several other services were affected.

“After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file,” Prince wrote in a post-mortem of the outage.

Prince explained that the problem “was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a ‘feature file’ used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.”

These machines run software that routes traffic across the Cloudflare network. The software “reads this feature file to keep our Bot Management system up to date with ever changing threats,” Prince wrote. “The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.”

Sorry for the pain, Internet

After replacing the bloated feature file with an earlier version, the flow of core traffic “largely” returned to normal, Prince wrote. But it took another two-and-a-half hours “to mitigate increased load on various parts of our network as traffic rushed back online.”