Skip to content
Tech News
← Back to articles

Our agent found a bug with WireGuard in Google Kubernetes Engine

read original get WireGuard VPN Router → more articles
Why This Matters

This discovery highlights how a bug in the WireGuard integration within Google Kubernetes Engine's networking layer can cause widespread instability, impacting user experience and cloud infrastructure reliability. It underscores the importance of thorough testing and monitoring of open-source components integrated into critical cloud services, especially as AI tools become more prevalent in debugging. Addressing such issues is vital for maintaining trust and performance in cloud-based applications for both providers and consumers.

Key Takeaways

The Scent

Last week, our users started seeing errors that didn't make sense. Sometimes opening a project would fail. Sometimes cloning code from GitHub would time out. We were even seeing the dreaded "Connection reset by peer". There was no real obvious pattern, which is always the worst kind of pattern.

On a platform like Lovable, which currently creates more than 50 sandboxes per second during peak hours, even a small percentage of failures can be a big problem for our users. Something in our infrastructure was wobbling, and we needed to find it.

Following the Trail

Sascha, one of our infrastructure engineers, started where any good debugging session begins: the logs. But we had millions of log lines to sift through, and patterns weren't jumping out. He decided to try something new. He'd been experimenting with AI agents for debugging, and this felt like the right moment to lean on them. He set up an agent with access to our Clickhouse logs and started asking it questions. The agent surfaced a suspicious issue: the anetd pods in our Google Kubernetes Engine cluster were restarting constantly, around 120 restarts per pod over six days, which is almost one crash per hour. Surely, this couldn't be right!

For context, anetd is Google's implementation of Cilium, the networking layer inside our Kubernetes clusters. When anetd crashes, new pods can't get network interfaces. And when your entire product depends on spinning up fresh sandboxes continuously, networking instability quickly translates into user-facing failures.

Sascha dug into the crash dumps. The stack trace pointed to a concurrent map-access panic, multiple goroutines trying to read and write to the same data structure at the same time without proper locking. But the key detail was where the panic happened: inside the Wireguard module of anetd.

WireGuard itself is an open-source encryption protocol, which Google does not own. But they do own the code that integrates it into anetd, their networking daemon for GKE. The panic was happening in Google's integration code, specifically in how they were managing concurrent access to a map data structure that tracked Wireguard connections.

This matters because it means the bug was in Google's implementation, not in WireGuard itself. Ergo, we'd need Google's help to fix it.

Pulling in Support

... continue reading