Skip to content
Tech News
← Back to articles

Incident Report: May 19, 2026 – GCP Account Suspension

read original get Google Cloud Platform Mug β†’ more articles
Why This Matters

The incident highlights the critical dependence of cloud-based services on platform stability, emphasizing the importance of robust incident management and contingency planning for tech companies. It also underscores the potential ripple effects of cloud service disruptions on end-user experience and business operations, prompting both providers and consumers to prioritize resilience strategies.

Key Takeaways

πŸš… This report reflects what we know at time of publication and may be updated pending Google Cloud's internal review.

Railway experienced a platform-wide service disruption due to Google Cloud incorrectly placing our account in a suspended status. This resulted in a temporary loss of service for all GCP hosted infrastructure. This infrastructure supports our dashboard, API, and pieces of our network infrastructure. As cached network routes expired, the outage extended beyond GCP to affect all Railway workloads.

Below, we walk through what happened, how we responded, and what we're doing to prevent a similar incident in the future.

On May 19, 2026 between 22:20 UTC and approximately 06:14 UTC on May 20 (~8 hours), Railway experienced a platform-wide outage after Google Cloud suspended services on our production account. This took our API, control plane and databases offline, along with compute infrastructure hosted on Google Cloud.

Users immediately experienced 503 errors on the dashboard and API, including "no healthy upstream" and "unconditional drop overload" messages, and were unable to log in. All workloads hosted on Google Cloud compute were taken offline.

While workloads on our own Railway Metal and AWS burst-cloud environments remained up, Railway's edge proxies rely on a Google Cloud-hosted control plane API to populate their routing tables, causing the outage to cascade beyond Google Cloud. As the route caches expired, these other workloads became unreachable, resulting in returning 404 errors as the network control plane could no longer resolve routes to active instances. At peak impact, all Railway workloads across all regions were rendered unreachable.

As we recovered our Google Cloud environment, builds and deployments were blocked platform-wide while we restored the individual services. Once the entirety of our infrastructure was restored, a significant backlog of queued deploys was gradually drained to avoid overwhelming the platform. In parallel, GitHub began rate-limiting Railway's OAuth and webhook integrations, temporarily blocking logins and builds. The volume of these calls increased as a result of our caches being cleared from the Google Cloud outage. As a side effect, Terms-of-service acceptance records were also reset, prompting users to re-accept on their next visit to the dashboard.

We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage, and detail below what happened, how we recovered, and the changes we are making to prevent this from happening again.

May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.

May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in.

... continue reading