Bypassing the Branch Predictor

A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.

Let’s work with something relatively simple & concrete: consider that we want to write some kind of financial system (maybe a trading system) and all of our transaction requests arrive at a certain function before being either (a) sent out to some authoritative server, or (b) abandoned. Let’s also assume that:

The vast majority of our transaction requests end up being abandoned at the last step. We care A LOT about the speed of the ‘send’ path and we want to make it as fast as possible. We don’t care at all about the speed of the ‘abandon’ path.

The code would look roughly like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 /* Defined somewhere else */ struct Transaction ; bool should_send ( Transaction * t ); void send ( Transaction * t ); void abandon ( Transaction * t ); void resolve ( Transaction * t ) { if ( should_send ( t )) { send ( t ); } else { abandon ( t ); } }

The implication of assumption #1 is that the branch predictor will be heavily primed to predict that should_send(t) returns false . What this means is that when we finally get a transaction that needs to be sent out, we’ll likely have to pay a branch misprediction penalty (~20 cycles); plus, our send() function may not be in the instruction cache just yet, because it’s so rarely executed. Also, we won’t get the advantage of pipelining whatever instructions are needed for send() .

Since we only care about the speed of the send() path, I was wondering if there’s a way of telling the CPU that we don’t actually want to rely on the branch predictor at all when executing send() or abandon() : we always want to assume that send() will be executed.

A low-level solution?

I asked Claude if there is such a way to basically hard-code branch prediction rules into the machine code, and the answer was that there’s no way to do this on x86, but there is a way on ARM: the BEQP (predict branch taken) and BEQNP (predict branch not taken) instructions.

Those ARM instructions are just hallucinated, and the reality is actually the other way around: ARM doesn’t have a way of hard-coding ‘predictions’, but x86 does. More accurately: some old x86 processors do. On the Pentium 4 series, those rules/hints are encoded as instruction prefixes:

... continue reading