It doesn’t do any good to have a 1-million node cluster if you can’t schedule pods on it. The Kubernetes scheduler is a pretty common bottleneck for large jobs. I ran a benchmark, scheduling 50K pods on 50K nodes, and it took about 4.5 minutes. That’s already uncomfortably long. If you’re creating pods with some sort of replication controller like Deployment, DaemonSet, or StatefulSet, which can be a bottleneck even before the scheduler. DaemonSet creates a burst of 500 pods at a time and then waits for the Watch stream to show that those are created before proceeding (the rate depends on many factors but expect <5K/sec). The scheduler doesn’t even get a chance to run until those pods are created. For this 1-million node cluster project, I set an ambitious goal of being able to schedule 1 million pods in 1 minute . Admittedly the number is somewhat arbitrary, but the symmetry with all those m's seemed nice. I also wanted to keep full compatibility with the standard kube-scheduler. It would be far easier to write a simplified scheduler from scratch that scales impressively in narrow scenarios but then fails spectacularly in real-world use cases. There’s a lot of complexity in the existing scheduler that arises from being battle-tested across lots of different production environments. Stripping away those pesky features to make a “faster” scheduler would be misleading. So, we’re going to preserve the functionality and implementation of the kube-scheduler as much as we can. What’s getting in our way to making it more scalable? kube-scheduler works by keeping state of all nodes, and then has a O(n*p) loop, where for each pod it evaluates it against every node. First it filters out nodes that the pod wouldn’t fit at all. Then, for each remaining node, it calculates a score on how well that node would match the pod. The pod is then scheduled to the highest-scoring node, or a random choice among the highest-scoring nodes if there’s a tie. It parallelizes the filtering of ineligible nodes, as well as the scoring of nodes against a particular pod. When there is a large number of eligible nodes, it only scores a fraction of them, down to 5% for large clusters. This is parallelizable. And to be fair, the scheduler does parallelize the filtering and generation of scores of nodes against a particular pod. But the scheduler is still burdened by having to do it for all nodes. This isn’t just parallelizable, this can also be distributable. Basic design: shard on nodes It’s akin to the classic scatter/gather design of a distributed search system. Think of each node as a document in the corpus, and each pod as a search query. The query is fanned out to many shards, each responsible for a fraction of the documents. Each shard selects its top candidate(s) and sends them back to a central gatherer to ultimately identify the overall top result. Generic scatter-gather design pattern The key difference is that in search, documents are read-only and thus queries can be evaluated in parallel without conflicts. In scheduling, however, executing a decision actually modifies the documents (i.e. allocates node resources). If two pods are scheduled in parallel to the same node, one may succeed while the other must fail due to insufficient resources. Nevertheless, for a large cluster it’s reasonable to design for “optimistic concurrency”. That is, presume that multiple pods can be scheduled at the same time without conflicts. We still check to see if a conflict arises before committing. And if a conflict occurs, we still do the correct thing by “rolling back” the other pod, e.g. it has to be re-scheduled. But the chance and resulting impact of this happening is low - low enough that you get much higher throughput by running in parallel and absorbing the wasted effort if it does happen. So my initial architecture idea of the distributed scheduler: Scatter-gather as a Kubernetes pod scheduler pattern The Relay starts a watch on unscheduled pods against the kube-apiserver. As streams of pods come in, the Relay forwards them to different schedulers. Each Scheduler is responsible for doing filtering and scoring against its own subset of the overall nodes, then sending back to Relay the top-winning node and score. The Relay then aggregates these scores, picks the overall winner, sends a true/false back to the Scheduler, and that true/false dictates whether the scheduler should actually bind the pod to that node. The scheduler here is thus a slightly modified version of the upstream kube-scheduler. It has a custom gRPC server endpoint for receiving the new pod. It has custom code to know which of the overall nodes it is responsible for. And it has a custom Permit extension point for sending the proposed node back to the Relay. The Permit extension point runs after the nodes are filtered and scored, but before the pod is bound. Permit extensions return ‘true’ or ‘false’ to approve whether or not the pod should be scheduled on the specified node. This is the basic design and it works pretty well. It doesn’t quite work up to 1M node scale - we’ll talk about that next - but it delivers a much more scalable scheduler solution than what exists today, while preserving all of the nuanced complex battle-tested logic of the current system. Today’s scheduler is effectively O(n x p), where n is the number of nodes and p the number of pods. That complexity becomes untenable as n grows. The sharded approach helps counteract the scaling problem: if you have n nodes, then you can shard the work across r replicas where r is some factor of n, turning that large factor back into something more tractable.