Metastable Failures and Interactions Between Systems

I’ve written about metastable failures before. The topic has been picked up by a few different teams since the, all analyzing metastable failures more, while I apparently has been slacking off… Anyway, Metastable failures are self-sustaining performance failures that arise in systems due to a positive feedback loop triggered by an initial problem. This positive feedback loop, or as I sometimes call it, a sustaining effect, is the defining characteristic of the metastable failure pattern. If we can somehow stop the loop, we stop the self-sustaining part, making recovery from the initial problem much easier.

Actions, States, and Signals

To better understand the problem, we need to examine the feedback loops. A typical metastable failure example is a retry storm: a serving system is overloaded due to an initial problem, the overload leads to high latency and timeouts for some requests, and clients retry those timed-out requests, creating even more load, higher latency, and even more timeouts and retries. In this scenario, we have two systems that interact with each other: clients and the serving system.

The two systems are in a control loop that has positive feedback. Under normal conditions, a client acts by sending some requests. These client actions change the state of the serving system, as it now has work to process. This state change produces a signal for the serving system to act by receiving and processing requests. Receiving and processing requests also changes the state of the serving system in many ways. One that interests us is the load on the serving system. The load state produces signals that clients can observe: request latency and request timeouts.

Here is the crux of the retry problem, though: we do not want to retry when the serving system is overloaded. We want to retry when there is an intermittent failure, such as a message loss, a server failover on the serving system, or any other short-lived failure. These failures, however, send the same signal to clients, as the overload state: request timeouts.

Going back to metastable failure, when the serving system slows down, the load increases, and this state produces request-timeout signals, the clients act on those signals and retry some work. It is this action performed erroneously on an ambiguous signal that causes the serving system to receive more work, become even more loaded, and produce a stronger request-timeout signal, completing the loop.

In short, we can explain interactions as follows: systems or components observe signals and act on them by interacting with other systems/components. These interactions change the state of the components, which emit new signals that can be acted upon. While signals are proxies for some state of a system or component, they are often ambiguous — different states may produce the same signal, and different systems or components may produce the same signal.

Avoiding Metastable Failures?

In the intuitive model described above, there are a few ways to avoid metastable failures.

Avoid interactions between components. These interactions provide input/output to/from systems, and “interesting” programs tend to both need input and produce output, so altogether avoiding them is not an option. However, minimizing unnecessary interactions is a good mitigation strategy.

... continue reading