Find Related products on Amazon

Shop on Amazon

Faster sorting with SIMD CUDA intrinsics (2024)

Published on: 2025-07-28 23:45:09

Full code on Github: https://github.com/wiwa/blog-code/ Hi Link to heading Recently, I finished a batch at the Recurse Center… is what I would have said if this post were written when I intended to write it (i.e. 3 months ago). My project there focused on a questionable application of CUDA (mostly irrelevant to this post), but it got me thinking more about other GPU-friendly algorithms. Instead of my Recurse project (which I hope to write about in a later post), I want to simply begin writing about technical stuff I’ve played around with. Today will be about a high-level overview of a particular kind of parallel sorting algorithm called bitonic sort . I’ll go over the context behind around algorithm, a few basics of SIMD programming, a CUDA implementation, and how a small optimization grants it a +30% performance uplift. We’ll be using a SIMD-like operation (a CUDA instruction called __shfl_sync ) to quickly sort 32-element “vectors”. What is a Bitonic Sort? Link to heading But wh ... Read full article.