K8s with 1M nodes

It doesn’t do any good to have a 1-million node cluster if you can’t schedule pods on it. The Kubernetes scheduler is a pretty common bottleneck for large jobs. I ran a benchmark, scheduling 50K pods on 50K nodes, and it took about 4.5 minutes. That’s already uncomfortably long.

If you’re creating pods with some sort of replication controller like Deployment, DaemonSet, or StatefulSet, which can be a bottleneck even before the scheduler. DaemonSet creates a burst of 500 pods at a time and then waits for the Watch stream to show that those are created before proceeding (the rate depends on many factors but expect <5K/sec). The scheduler doesn’t even get a chance to run until those pods are created.

For this 1-million node cluster project, I set an ambitious goal of being able to schedule 1 million pods in 1 minute . Admittedly the number is somewhat arbitrary, but the symmetry with all those m's seemed nice.

I also wanted to keep full compatibility with the standard kube-scheduler. It would be far easier to write a simplified scheduler from scratch that scales impressively in narrow scenarios but then fails spectacularly in real-world use cases. There’s a lot of complexity in the existing scheduler that arises from being battle-tested across lots of different production environments. Stripping away those pesky features to make a “faster” scheduler would be misleading.

So, we’re going to preserve the functionality and implementation of the kube-scheduler as much as we can. What’s getting in our way to making it more scalable?

kube-scheduler works by keeping state of all nodes, and then has a O(n*p) loop, where for each pod it evaluates it against every node. First it filters out nodes that the pod wouldn’t fit at all. Then, for each remaining node, it calculates a score on how well that node would match the pod. The pod is then scheduled to the highest-scoring node, or a random choice among the highest-scoring nodes if there’s a tie.

It parallelizes the filtering of ineligible nodes, as well as the scoring of nodes against a particular pod.

When there is a large number of eligible nodes, it only scores a fraction of them, down to 5% for large clusters.

This is parallelizable. And to be fair, the scheduler does parallelize the filtering and generation of scores of nodes against a particular pod. But the scheduler is still burdened by having to do it for all nodes. This isn’t just parallelizable, this can also be distributable.

Basic design: shard on nodes

... continue reading