Thundering herd problem: Preventing the stampede

Couple of years back I wrote a post describing the thundering herd problem. Now that I look back, I had very cursory knowledge about the problem. I knew what it was & how can it impact an application. I knew a possible solution that can be used to solve the problem. But these were individual pieces of information. There was nothing connecting them to give me an end to end view & that was primarily because I had neither seen the problem nor the solution in action. I knew the fundamental blocks but that was it.

So in this post I will take another stab at this problem area but this time I will couple it with an actual implementation. We will reproduce the thundering herd problem & then take a look at couple of solutions which can be used to solve the problem. Lets dive in.

Understanding thundering herd problem

A typical solution that pops up whenever an application experiences increased query load on a database is to place a cache in between the web server & the database. Intention behind this is that, we first lookup from cache & if there is an existing entry then we avoid looking up from database & return the entry. If there is a cache miss then we lookup from database & backfill the cache so that it can be used for future requests. This works as we end up using cache for the hot keys which are frequently queried while keeping a low traffic on the database. Sweet & simple.

The issue begins when we have large number of concurrent requests for the same record. All requests end up querying the cache at the same time & all of them end up with a cache miss. This results in all requests ending up querying the same record from the database & eventually increasing the overall load on the database. This means if your application starts experiencing lookup patterns that comprises of multiple such hot keys, your original solution of using a cache to decrease load on database no longer works.

Reproducing thundering herd problem

Lets take a look at an application which demonstrates this problem. We have a simple backend application built using Spring Boot that uses a Postgres database to store records & Redis to cache the values. Our application follows the cache-aside pattern as described above. I have also added Zipkin tracing to verify the thundering herd problem. Lets take a look at the service code that results in the thundering herd problem(Note that I have cleaned up the code to remove tracing instrumentation. You can view the complete code for this demo on this Github link).

public Product getProductById(UUID id) throws ProductNotFoundException { String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id; // Check if product is in cache Product productFromCache = redisTemplate.opsForValue().get(cacheKey); // If product is in cache, return it if (productFromCache != null) { return productFromCache; } // If the product is not in cache, fetch it from DB Product product = productRepository.findById(id) .orElseThrow(() -> new ProductNotFoundException(id)); // Backfill the cache redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL); return product; }

The above code will result in thundering herd problem if we get concurrent requests for the same key. In order to test it, I have used a small Golang program to replicate concurrent request scenario. Lets view a demo for this:

In the above demo, you can clearly see that for every request querying the same product id, we see 3 separate traces:

... continue reading