What it takes to transpose a matrix

As was mentioned earlier, block algorithm becomes more efficient as word size increases. General purpose registers are only 64 bit long, so there is nothing more we can do about them. On the other hand, x86_64 carries dedicated SIMD registers and instructions with widths of 128, 256 and 512 bits depending on CPU generation. For example, with 256 bit registers we could create efficient 32×32 transposer by using log 2 (32)=5 levels of 2×2 block decomposition. Alas, working with SIMD extensions is much more challenging than issuing + or >> . A multitude of high-level instructions allows to do the same thing in different ways and with different performance results. Atop of that, availability of SIMD extensions heavily depends on CPU architecture and generation. Here we will focus on relatively modern generations of x86_64 CPUs.

First let’s do short review of what SIMD extensions x64_64 CPUs offer. The very first widely used SIMD extension for x86 CPUs was MMX ("multimedia extensions"), now already forgotten. It introduced a set of new registers which could be treated as vectors of 8/16/32-bit integers and a dedicated instruction set that allowed to perform operations on the elements of these registers in parallel. As time passed, more and more extensions were added. Newly released extensions expanded register width, increased number of registers or introduced brand new instructions. Extensions below are grouped by target vector width and are listed in historical order:

SSE family (SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2): 128-bit registers, universal availability. Full set of SSE extensions provides a multitude of instructions that treat 128-bit registers as vectors of 32/64-bit floating point or 8/16/32/64 integer values. It would be hard to find a CPU without SSE support as it is already two decades old. In fact, most widely used x86_64 ABIs (application binary interfaces) mandate that floating point numbers are to be passed to functions in SSE registers rather than de-facto obsolete FPU. As such, SSE-capable CPU is mandatory for running modern applications.

AVX family (AVX, AVX2): 256-bit registers, availability is almost universal. When original AVX was introduced, it targeted only floating point numbers. AVX2 later fixed this by adding missing integer instructions. All CPUs manufactured after approximately 2015 support both original AVX and AVX2 extensions. It is perfectly safe to use all AVX instructions in server environment, however a runtime check and a separate non-AVX algorithm is advised if you target application for workstations.

AVX-512 (AVX-512F, AVX-512VL, AVX-512BW and a dozen of others) family further expands the width of the registers, now to 512 bits. However, the availability of CPUs with AVX-512 is still not universal as of 2023. Besides, the whole family of AVX-512 family was fragmented into more than a dozen of small extensions since its inception. The most common extensions are supported by all AVX-512-advertised CPUs, but support for other extensions is flappy: recently manufactured CPUs may lack some extensions supported by CPUs of previous generations. This leaves the choice of selecting which instructions are safe to use and which should be avoided to program authors.

AMX is the most recent extension relevant to our problem. It defines high-level operations on small matrices, such as matrix multiplication. Its registers ("tiles") are two dimensional, 1 KiB each and must be configured before use. Configuration specifies effective number of rows (≤16) and row size (≤64 bytes) of the tile. Unlike vector extensions, AMX acts directly on matrices and therefore is particularly promising for the problem at hand.

I will assume that we have access to AVX2 but not the newer extensions. AVX2 gives us access to 256 bit registers and integer instructions. This means that we will be able to create efficient 32x32 transposer by using log 2 (32)=5 levels of block decomposition.

As before, we will work level by level, row by row. Every row will be represented by a variable of __m256i type, which we will treat as a vector of type uint8_t[32] . This type is defined in immintrin.h header, shipped with all major compilers (intel, gcc, clang) which support AVX2. Variable of this type can be naturally mapped to a 256-bit AVX register by the compiler in the same way as uint64_t variable can be mapped to a general purpose 64-bit register. The same header also provides the intrinsics to work with these registers. Each intrinsic is mapped to respective CPU instruction, thus making it unnecessary to manually write assembler code. In our code we will rely on only three instructions: _mm256_shuffle_epi8 , _mm256_blendv_epi8 , _mm256_permute2x128_si256 . In addition, two more instructions will be used implicitly: _mm256_load_si256 and _mm256_store_si256 . They move data between memory and 256-bit registers. We do not need to call them manually since compilers can insert such instructions automatically on encountering _mm256i r = *ptr; and *ptr = r; respectively. Semantics of the other three instructions is more complicated and is provided below.

Shuffle. _mm256_shuffle_epi8 instruction accepts a source 256-bit register and a 256-bit control register, and, treating input as a vector of 8-bit values, shuffles them according to the control register, returning the result. Control register specifies for every index (0..31) of the output vector either the index of the source vector where to copy an element from, or carries a special value indicating that target location must be zeroed. Such semantics makes shuffle a very powerful instruction. If this instruction didn’t have any limitations, it would be possible to do any of the following with just single short-running instruction:

reverse the order of the elements

... continue reading