What 100k concurrent sandboxes has taught us so far

What 100,000 Sandboxes Taught Us

tl;dr We have moved the Scale Invitational to June 17th after underestimating the complexity of orchestrating and measuring 100,000 sandboxes concurrently.

The Scale Invitational was meant to answer a simple question: “when an app needs to spin up tens of thousands of sandboxes at once, how do providers hold up?”

Our daily benchmark already measures cold-start time, staggered ramp, and small-scale burst behavior across providers. But “small-scale” is the operative word — a few hundred sandboxes from a single runner tells you almost nothing about what happens at scale. So we decided to run a 100,000-sandbox test.

What we did not expect was how much we’d learn before ever running the test.

This post is a candid look at the problems we ran into, the course-corrections, and why we’re deliberately taking our time before publishing a single 100k number.

v1: 10,000 iterations from one very busy VM

The first version was the obvious one. Take the benchmark we already trust, crank up the iteration count, and point it at a single beefy VM. We pushed it up toward 10,000 sandbox creations from one machine and watched it work.

It worked — right up until it didn’t tell us anything useful.

A single VM has a single network stack, a single event loop, and a single egress IP. Long before you reach interesting provider behavior, you start measuring your own machine’s limits. The numbers we got back were as much a portrait of our test rig as of any provider.

... continue reading