Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

Fast Branchless Quicksort using Sorting-Networks with C and C++ Interface

Performance results naturally depend on the underlying hardware. The following benchmarks show the execution times for sorting 50 million doubles using different sorting implementations. The measurements were taken on an Apple M1 system using Clang and on an AMD Ryzen 3 Linux system using GCC, both compiled with the -O3 option.

Implementation Apple M1 AMD Ryzen std::sort 1.33s 5.56s pdqsort 1.33s 2.81s blqsort (single threaded) 0.97s 2.06s

For a fair comparison, the single-threaded version of blqs was used here. On an M1, the threaded versions are another factor of 3 to 4 faster. In terms of runtime, the C++ versions differ only very little from the C version.

blqsort

Full source code is included on Github.There are four implementations of blqsort here, each provided as a single header file.

Branchless programming

On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

for (int i = 0; i < 1000; i++) { small_numbers[smlen] = numbers[i]; smlen += (numbers[i] < 500); }

for (int i = 0; i < 1000; i++) { if (numbers[i] < 500) { small_numbers[smlen] = numbers[i]; smlen += 1; } }

... continue reading