Tech News
← Back to articles

Google links massive cloud outage to API management issue

read original related products more articles

Google says an API management issue is behind Thursday's massive Google Cloud outage, which disrupted or brought down its services and many other online platforms.

Google says the cloud outage started around 10:49 ET and ended at 3:49 ET, after causing issues for millions of users worldwide for over three hours.

Besides Google Cloud, the incident also impacted Gmail, Google Calendar, Google Chat, Google Cloud Search, Google Docs, Google Drive, Google Meet, Google Tasks, Google Voice, Google Lens, Discover, and Voice Search.

However, it also caused widespread issues for third-party platforms that rely on Google Cloud, including but not limited to Spotify, Discord, Snapchat, NPM, Firebase Studio, and a limited number of Cloudflare services relying on the Workers KV key-value store.

"We are deeply sorry for the impact to all of our users and their customers that this service disruption/outage caused. Businesses large and small trust Google Cloud with your workloads and we will do better," Google said.

While it's still working on publishing a full incident report, Google revealed today the root cause of what caused an increased number of 503 errors in external API requests during yesterday's three-hour-long outage.

As the company explained today, its Google Cloud API management platform failed due to invalid data, an issue that wasn't discovered and remediated promptly because it lacked effective testing and error-handling systems.

"From our initial analysis, the issue occurred due to an invalid automated quota update to our API management system which was distributed globally, causing external API requests to be rejected. To recover we bypassed the offending quota check, which allowed recovery in most regions within 2 hours," the company added.

"However, the quota policy database in us-central1 became overloaded, resulting in much longer recovery in that region. Several products had moderate residual impact (e.g. backlogs) for up to an hour after the primary issue was mitigated and a small number recovering after that."

Cloudflare services taken down by Google's outage

... continue reading