Tech News
← Back to articles

Jepsen: NATS 2.12.1

read original related products more articles

1 Background

NATS is a popular streaming system. Producers publish messages to streams, and consumers subscribe to those streams, fetching messages from them. Regular NATS streams are allowed to drop messages. However, NATS has a subsystem called JetStream, which uses the Raft consensus algorithm to replicate data among nodes. JetStream promises “at least once” delivery: messages may be duplicated, but acknowledged messages should not be lost. Moreover, JetStream streams are totally ordered logs.

JetStream is intended to “self-heal and always be available”. The documentation also states that “the formal consistency model of NATS JetStream is Linearizable”. At most one of these claims can be true: the CAP theorem tells us that Linearizable systems can not be totally available. In practice, they tend to be available so long as a majority of nodes are non-faulty and communicating. If, say, a single node loses network connectivity, operations must fail on that node. If three out of five nodes crash, all operations must fail.

Indeed, a later section of the JetStream docs acknowledges this fact, saying that streams with three replicas can tolerate the loss of one server, and those with five can tolerate the simultaneous loss of two.

Replicas=5 - Can tolerate simultaneous loss of two servers servicing the stream. Mitigates risk at the expense of performance.

When does NATS guarantee a message will be durable? The JetStream developer docs say that once a JetStream client’s publish request is acknowledged by the server, that message has “been successfully persisted”. The clustering configuration documentation says:

In order to ensure data consistency across complete restarts, a quorum of servers is required. A quorum is ½ cluster size + 1. This is the minimum number of nodes to ensure at least one node has the most recent data and state after a catastrophic failure. So for a cluster size of 3, you’ll need at least two JetStream enabled NATS servers available to store new messages. For a cluster size of 5, you’ll need at least 3 NATS servers, and so forth.

With these guarantees in mind, we set out to test NATS JetStream behavior under a variety of simulated faults.

2 Test Design

We designed a test suite for NATS JetStream using the Jepsen testing library, using JNATS (the official Java client) at version 2.24.0. Most of our tests ran in Debian 12 containers under LXC; some tests ran in Antithesis, using the official NATS Docker images. In all our tests we created a single JetStream stream with a target replication factor of five. Per NATS’ recommendations, our clusters generally contained three or five nodes. We tested a variety of versions, but the bulk of this work focused on NATS 2.12.1.

... continue reading