Deterministic Simulation Testing in Rust: A Theater of State Machines

It's been just a year since we wrote about how we implemented deterministic simulation testing (DST) of our Go database (FrostDB). Since then, we have been hard at work writing our new database backed by object storage in Rust (read more here about why). In this blog post, I want to lay out the approach we took to build our new Rust database with DST principles front-and-center and how it compares to the approach we took with FrostDB.

As a quick recap, DST tests are randomized full system integration tests where any failure can be reproduced using the same initial random seed. DST helps you find complex bugs in your system before customers hit them in production. Since failures can be deterministically reproduced, it significantly reduces the time needed to replicate hard-to-find bugs and streamlines the debugging process so that more time can be spent on feature work. The confidence this provides in system correctness is transformative: developers can ship complex features at full speed, knowing that DST will catch any subtle regressions before they reach production. For more on DST principles and benefits, see our previous blog post.

The Four Ingredients of DST

In essence, DST can be boiled down to controlling four main ingredients: concurrency, time, randomness, and failure injection. The first three are non-negotiable: you must have full control over task scheduling while injecting replayable sources of time and randomness. Deterministic execution can already be achieved with just these three ingredients. Failure injection is not required: zero failure injection is just a deterministic and reproducible integration test of a random execution schedule of your system, which is already valuable. However, the real power of DST is when you turn up the failure injection knob.

This amount of control requires you to think carefully about how your code is designed and run. When we implemented DST in FrostDB, we had to achieve this level of control with an existing codebase. In Go, the default concurrency model is that goroutines are multiplexed by the language runtime onto a set of OS threads, giving up your scheduling control to not one, but two external actors. Our choices were therefore to rewrite an existing complex codebase with its own scheduler, or figure out a way to control the existing scheduler by forcing single-threaded execution and making the runtime scheduler decisions deterministic.

We chose the latter approach and with some small changes, gained control of time and randomness. This approach was effective and found a good number of issues, including data loss and data duplication bugs. However, we were never really satisfied with the amount of failure injection we were able to achieve with this approach, since every type of failure required a specific interface implementation (e.g. VFS) and any scheduler-related failure injection (e.g. randomized execution schedules) would have required us to modify the Go runtime scheduler more than we wanted to.

We could have gone down the same path with our Rust database. The folks at RisingWave have helpfully written a deterministic futures executor called madsim, a drop-in replacement for tokio. Madsim offers deterministic execution of futures given a random seed, and also helpfully implements failure injection for some interfaces.

For an existing Rust codebase implementing DST, madsim is probably a good choice. However, writing a new codebase from scratch gave us a golden opportunity to claim the responsibility of full control over concurrency, time, randomness, and failure injection. So we decided to go all in.

Enter State Machines

Confident in our decision, we chose to architect our new database as a set of state machines, heavily inspired by the actor-based concurrency model outlined in the sled simulation guide. The idea is that all of your core components are written as single-threaded state machines. In production, these state machines are wrapped by a “driver” that routes messages to other state machines via the network and handles local state machine outputs. In testing, a single thread drives an event loop and routes messages between state machines using a message bus.

... continue reading