Everything I know about good system design

I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are new to the industry. Another is the Twitter-optimized “you’re a terrible engineer if you ever store booleans in a database” clever trick. Even good system design advice can be kind of bad. I love Designing Data-Intensive Applications, but I don’t think it’s particularly useful for most system design problems engineers will run into.

What is system design? In my view, if software design is how you assemble lines of code, system design is how you assemble services. The primitives of software design are variables, functions, classes, and so on. The primitives of system design are app servers, databases, caches, queues, event buses, proxies, and so on.

This post is my attempt to write down, in broad strokes, everything I know about good system design. A lot of the concrete judgment calls do come down to experience, which I can’t convey in this post. But I’m trying to write down what I can.

Recognizing good design

What does good system design look like? I’ve written before that it looks underwhelming. In practice, it looks like nothing going wrong for a long time. You can tell that you’re in the presence of good design if you have thoughts like “huh, this ended up being easier than I expected”, or “I never have to think about this part of the system, it’s fine”. Paradoxically, good design is self-effacing: bad design is often more impressive than good. I’m always suspicious of impressive-looking systems. If a system has distributed-consensus mechanisms, many different forms of event-driven communication, CQRS, and other clever tricks, I wonder if there’s some fundamental bad decision that’s being compensated for (or if the system is just straightforwardly over-designed).

I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design. I say “usually” because sometimes you do need complex systems. I’ve worked on many systems that earned their complexity. However, a complex system that works always evolves from a simple system that works. Beginning from scratch with a complex system is a really bad idea.

State and statelessness

The hard part about software design is state. If you’re storing any kind of information for any amount of time, you have a lot of tricky decisions to make about how you save, store and serve it. If you’re not storing information, your app is “stateless”. As a non-trivial example, GitHub has an internal API that takes a PDF file and returns a HTML rendering of it. That’s a real stateless service. Anything that writes to a database is stateful.

You should try and minimize the amount of stateful components in any system. (In a sense this is trivially true, because you should try to minimize the amount of all components in a system, but stateful components are particularly dangerous.) The reason you should do this is that stateful components can get into a bad state. Our stateless PDF-rendering service will safely run forever, as long as you’re doing broadly sensible things: e.g. running it in a restartable container so that if anything goes wrong it can be automatically killed and restored to working order. A stateful service can’t be automatically repaired like this. If your database gets a bad entry in it (for instance, an entry with a format that triggers a crash in your application), you have to manually go in and fix it up. If your database runs out of room, you have to figure out some way to prune unneeded data or expand it.

What this means in practice is having one service that knows about the state - i.e. it talks to a database - and other services that do stateless things. Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service. If you can, it’s worth doing this for the read logic as well, although I’m less absolutist about this. It’s sometimes better for services to do a quick read of the user_sessions table than to make a 2x slower HTTP request to an internal sessions service.

... continue reading