Our agent found a bug with WireGuard in Google Kubernetes Engine

The Scent

Last week, our users started seeing errors that didn't make sense. Sometimes opening a project would fail. Sometimes cloning code from GitHub would time out. We were even seeing the dreaded "Connection reset by peer". There was no real obvious pattern, which is always the worst kind of pattern.

On a platform like Lovable, which currently creates more than 50 sandboxes per second during peak hours, even a small percentage of failures can be a big problem for our users. Something in our infrastructure was wobbling, and we needed to find it.

Following the Trail

Sascha, one of our infrastructure engineers, started where any good debugging session begins: the logs. But we had millions of log lines to sift through, and patterns weren't jumping out. He decided to try something new. He'd been experimenting with AI agents for debugging, and this felt like the right moment to lean on them. He set up an agent with access to our Clickhouse logs and started asking it questions. The agent surfaced a suspicious issue: the anetd pods in our Google Kubernetes Engine cluster were restarting constantly, around 120 restarts per pod over six days, which is almost one crash per hour. Surely, this couldn't be right!

For context, anetd is Google's implementation of Cilium, the networking layer inside our Kubernetes clusters. When anetd crashes, new pods can't get network interfaces. And when your entire product depends on spinning up fresh sandboxes continuously, networking instability quickly translates into user-facing failures.

Sascha dug into the crash dumps. The stack trace pointed to a concurrent map-access panic, multiple goroutines trying to read and write to the same data structure at the same time without proper locking. But the key detail was where the panic happened: inside the Wireguard module of anetd.

WireGuard itself is an open-source encryption protocol, which Google does not own. But they do own the code that integrates it into anetd, their networking daemon for GKE. The panic was happening in Google's integration code, specifically in how they were managing concurrent access to a map data structure that tracked Wireguard connections.

This matters because it means the bug was in Google's implementation, not in WireGuard itself. Ergo, we'd need Google's help to fix it.

Pulling in Support

... continue reading