An Update on TinyKVM

An update on TinyKVM fwsGonzo 8 min read · 1 hour ago 1 hour ago -- Listen Share

Hey all. TinyKVM was open-sourced this february and since then I’ve been working on some things that are very much outside of the scope of the original implementation. Originally, it was intended to be for pure computation (and that is very much still possible, and is the default), but makes it hard to use TinyKVM outside of specialized use-cases. So, I’ve relented and implemented limited support for running unmodified executables in TinyKVM. Specifically, run-times like Deno, Python WSGI and similar run-times like Lune.

I would like to make a special shout-out to Laurence Rowe who championed KVM server, which has now become almost a de-facto CLI for TinyKVM servers. It’s very much work in progress, but give it a try if you’re interested in these kinds of things.

In order to achieve this I picked the very untraditional route of implementing system call emulation, but as poorly as possible. And as few system calls as possible. I think today there is 50 real system calls (gVisor has ~200 for comparison), and all of them will to some degree make shit up (for lack of a better term). The goal is to avoid accessing the (shared) Linux kernel when at all possible, but give sanitized access when permitted and appropriate. To give an example of what I mean by this: The only allowed ioctl operations are setting and getting non-blocking mode (FIONBIO), and reading the number of available bytes (FIONREAD). This minimalist system call API is currently able to run quite a few complex run-times unmodified. Programs are surprisingly good at handling failing system calls, or suspicious return values. If you put a jailer on top it should be good enough for production, but I do still recommend to use TinyKVM in pure compute mode. Something like Jailer + TinyKVM + Deno + per request isolation.

Per-Request Isolation

Per-request isolation is apparently not that common. I could not find any other production-level support other than in wasmtime (and previously Lucet). But, due to its lack of in-guest JIT support it will not be able to compete with Deno so I will just focus on the positives: It uses a clever lazy MADV_DONTNEED mechanism which delays the cost. You can go test wasmtime’s per-request isolation right now with the hello-wasi-http example.

In TinyKVM there are two reset modes, which together forms hybrid per-request isolation that is capable of maintaining a low memory footprint. Together, it makes the fastest per-request isolation that exists right now. It’s main mode will directly rewrite all touched pages in a VM fork back to their original contents and then leave pagetables (and TLBs) intact. This mode has turned out to be the fastest, but as it leaves the memory footprint untouched it can only grow memory usage for forked VMs. Forked VMs are tiny to begin with, but for large page rendering work it may be a concern, hence there’s a second mode triggered by a fork using memory above a limit. The second mode resets the entire VM with pagetables and everything, which happens when it uses working memory above a certain limit or an exception occurs during request handling. It’s not particularly expensive on its own, but if every VM fork would do it all the time the IPIs and coherency chatter would be a bottleneck.

Press enter or click to view image in full size TinyKVM is able to beat native performance due to avoiding GC

So we ran a full Deno page rendering benchmark in TinyKVM and then also the very same (unmodified) benchmark natively. We made GC single-threaded in order to compare equally. It would normally run async in another thread, but you’d still have to pay the cost of doing it. What we found was that TinyKVM generally had lower p90+ latency, while native had better p50.

Now this is incredible. Per-request isolation is very very expensive. We are resetting an entire KVM VM every request back to its original state. And we’re doing it very close to native not doing it at all. We’re doing it with unmodified Deno, a big run-time, and with a full page rendering benchmark, a large piece of compute work that builds real memory pressure.

... continue reading