Replacing a Cache Service with a Database

Replacing a cache service with a database

I’ve been thinking about this: will we ever replace caches entirely with databases? In this post I will share some ideas and how we are moving towards it. tl;dr we are still not there, yet.

Why do we even use caches?

Caches solve one important problem: providing pre-computed data at insanely low latencies, compared to databases. I am talking about typical use cases where we use a cache along with the db (cache aside pattern), where the application always talks with cache and database, tries to keep the cache up to date with the db. There are other patterns where cache itself talks with DBs, but I think this is the more common pattern where application talks to both cache and database.

I’d like to keep my systems simple, and try to reduce dependencies, if possible. If databases can provide the same benefits as cache, it can go a long way before we decide to add an external caching service.

Instead of using a cache, like Valkey (or Redis), you could just set up a read replica and use it like a cache. Databases already keep some data in-memory (in buffer pool). Caches aren’t expected to be strongly consistent with the DB, and neither are read replicas. As an added benefit, you can use the same SQL queries instead of whatever custom interface the cache provides. Not using a cache would make things operationally so much simpler; and I’d never have to worry about cache invalidation.

If you use an embedded database (like SQLite, PGLite) with replication (like Litestream or libSQL), you’d even get zero network latency.

However, caches are still very prominent and can’t be replaced with just read replicas. I often think about how we can bridge the gap, but I think the workloads are so different that it’s not going to happen anytime soon. The closest we’ve come, I think, is Noria + MySQL (now ReadySet).

So why are caches still preferred? Comparatively, here are a few things caches do better than databases:

Setting up and destroying a cache is cheap; both operationally and cost-wise. Most workloads only cache a subset of the data, and developers have control over what that subset is. It uses fewer resources. With a DB + buffer pool, that level of control doesn’t exist today. Caches keep pre-computed data. I could do a complex join and then save the results in a cache. How could I achieve the same with a db? I don’t know of any database that lets me assign priority to specific rows to always keep them in the buffer pool. Caches also provide eviction policies (and TTL), which I can’t do with the DB buffer pool. Databases are orders of magnitude larger than caches. Using a full read replica that consumes terabytes of storage just to access a few gigabytes of hot data feels wasteful. Some cloud providers won’t even let you use larger SSDs without also upgrading CPU/memory. Cache services can handle hundreds of thousands of concurrent connections, whereas databases generally don’t scale that way. Database connections are expensive.

... continue reading