How We Found 7 TiB of Memory Just Sitting Around

“ Debugging infrastructure at scale is rarely about one big aha moment. It’s often the result of many small questions, small changes, and small wins stacked up until something clicks.

Inside the hypercube of bad vibes: the namespace dimension

Getting ready to dissect what I like to call: the Kubernetes hypercube of bad vibes. Credits: Hyperkube from gregegan.net, diagram (modified) from Kubernetes community repo

Plenty of teams run Kubernetes clusters bigger than ours. More nodes, more pods, more ingresses, you name it. In most dimensions, someone out there has us beat.

There's one dimension where I suspect we might be near the very top: namespaces. I say that because we keep running into odd behavior in any process that has to keep track of them. In particular, anything that listwatches them ends up using a surprising amount of memory and puts real pressure on the apiserver. This has become one of those scaling quirks you only really notice once you hit a certain threshold. As this memory overhead adds up, efficiency decreases: each byte we have to use for management is a byte we can't put towards user services.

The problem gets significantly worse when a daemonset needs to listwatch namespaces or network policies (netpols, which we define per namespace). Since daemonsets run a pod on every node, each of those pods independently performs a listwatch on the same resources. As a result, memory usage increases with the number of nodes.

Even worse, these listwatch calls can put significant load on the apiserver. If many daemonset pods restart at once, such as during a rollout, they can overwhelm the server with requests and cause real disruption.

Following the memory trail

A few months ago, if you looked at our nodes, the largest memory consumers were often daemonsets. In particular, Calico and Vector which handle configuring networking and log collection respectively.

We had already done some work to reduce Calico’s memory usage, working closely with the project’s maintainers to make it scale more efficiently. That optimization effort was a big win for us, and it gave us useful insight into how memory behaves when namespaces scale up.

... continue reading