Learning About GPUs Through Measuring Memory Bandwidth
At Traverse Research, we need to have a deep understanding of GPU performance to develop our benchmark, Evolve. Additionally, we sometimes do projects for very specific hardware where we need to know all the ins and outs of this hardware. One way we do this is by using microbenchmarks to measure specific parts of the GPU to get new insights. In this article, we will share what we learned from measuring the memory bandwidth of various GPUs. First we will be going over some background information about GPU hardware relating to loading from and storing to memory, then we will take a look at how our microbench is built, and finally we will look at some GPUs of which we measured the bandwidth and what we learned from that.
Background Information
Accessing memory on a GPU is quite a bit more complicated than on a CPU. In this section we will talk about different concepts that are handy to keep in mind when programming for a GPU.
Descriptors
Memory on a GPU is usually not directly accessed via a pointer like on a CPU. Although some hardware is capable of doing this, buffer and textures access usually happens via a descriptor. A descriptor is nothing more than a pointer with extra metadata to support more complex logic when fetching data. For example for a texture the hardware needs to know the resolution, swizzle pattern, format, number of mip levels, whether the texture uses MSAA, and more to be able to load from it. This is all encoded in the descriptor. How this descriptor is represented in binary is up to the hardware vendor and is thus something we generally cannot see directly. Buffers are also accessed via a descriptor but often only encode a pointer with a size. If you ever wondered how the hardware is able to return a default value when reading out of bounds, this is how.
Types of buffers
When talking about buffers there are a couple of distinctions we need to make between the various sorts, since each have their own advantages and disadvantages.
1. Byte Address Buffers
The most basic form is a Byte Address Buffer, or sometimes called the Raw Buffer, this type allows us to load any data type in the shader by passing a byte offset. However, that is not the complete story. GPUs are generally not able to access non-4-byte aligned data, so the byte offset in reality has to be a multiple of 4. Additionally, most hardware is able to load in data in chunks of 4, 8 and 16 bytes. This reduces the number of load requests. Some hardware is able to load these larger chunks from any 4-byte aligned address, but not all hardware is able to do so. Since Byte Address Buffers do not give any guarantees in terms of alignment the shader compiler may generate four 4-byte loads instead of a single 16-byte load.
2. Structured Buffers
Structured Buffers are a more strict version of the Byte Address Buffer. The graphics API requires the user to specify the size of the data type. Additionally, when not using bindless, a StructuredBuffer in the shader can only load one data type. These restrictions allow the driver and shader compiler to guarantee that data is aligned in an optimal way to allow for 8 and 16-byte load instructions.
3. Typed Buffers
Typed buffers are a bit more special as they use some of the functionality of the texture units. This allows us to load from, for example, a RGBA8_UNORM buffer and let the hardware unpack it to a float4. If we were to use for example a raw buffer, we would need to use ALU instruction to do this conversion. Of course this is a nice advantage but using the texture units is not always free either. The general advice is to not use typed buffers unless you really need the extra ALU that would otherwise have been spent the data unpacking.
Texture Units
While buffer loads are fairly straightforward, texture loads can get quite complicated. What's interesting is that all this extra complexity can be implemented partially or entirely in software, but it is often implemented in hardware for performance. Although this bit of the hardware comes in many flavors, we tend to refer to this bit of the hardware as the 'texture units'.
Let's go through the complexities that the texture units do one step at a time. Firstly, a texture load can be 1D, 2D, 3D, have mip levels, be an array, be a cubemap, or a mix of these. For each of these formats a floating point coordinate needs to be converted to a texel coordinate. Secondly, there are the address mode sampler settings that decide whether UV coordinates need to be wrapped around, clamped, mirrored or if a border color needs to be used. Finally, the filtering mode also has an impact on the address calculation, as with linear and anisotropic filtering, more than one texel needs to be loaded. All this together can result in a lot of logic and texel loads, just imagine doing an anisotropic sample in the corner of a cubemap array texture.
The texture units do more than only address calculation. When it loads a texel, it may also need to unpack it to a float32 value. For a format like RGBA8_UNORM this is four relatively simple byte to float conversions, but the format may also require an sRGB calculation or use a more complicated block compressed format. To give an example of such a format, in the block compressed formats of PC GPUs, BC1 till BC3, we find two 16-bit colors with a 4x4 matrix of 2-bit elements. Each element indicates how to blend the two 16-bit colors together to create the decompressed color. Finally, the texture unit blend all the loaded texels together which is then stored in the requested register for use by the shader.
Memory Hierarchy
Due to physical limitations like the speed of light it is impossible to have a large amount of VRAM like 16GiB be fast enough to keep up with the computational speed of a GPU. This is both a limitation in bandwidth and latency. To alleviate this problem caches are added in various places in the chip. These vary in size and speed as well as proximity to where ALU operations happen. The general rule of thumb is, the closer a cache is to where computation is happening, the faster and smaller it is. On most hardware we find at least 2 levels of cache (L1 and L2) as well as an instruction cache (I-Cache). AMD's RDNA4 hardware even has an L0, L1, L2, Infinity Cache, Instruction Cache, and Scalar Cache. The multiple levels of cache complement each other to create a good balance between size and performance.
Write-through vs write-back vs write-around
Caches do impose an interesting problem when it comes to writing data, do you let the write go through the other caches to main memory, do you wait until the cache line is evicted, or do you skip the cache entirely? These different approaches are called 'write-through', 'write-back', and 'write-around'. In general, we mostly find the write-back approach on GPUs. The advantage of this approach is something called 'write combining'. This lets more stores accumulate on the same cache line so that the whole line can be written into the next level of the memory hierarchy all at once when the cache line is evicted instead of a small store every time a store happens.
The caches in a GPU are really optimized for this type of spatially local memory access, where these bursts of memory can be read or written at the same time. When a shader accesses data in a very sparse pattern, the hardware can try to coalesce these memory requests where possible, but will at some point have to send out more requests than if the memory accesses were more spatially local.
Some algorithms may require reading data generated by other threads in the same shader. A simple example of this is a BVH refit where you have a thread per node. Each thread traverses up the tree and would start working on the same data as its neighboring thread, except that we use atomics to kill the thread that got there first. That creates the guarantee that one thread is working on one node. What now needs to happen is that the thread needs to load the AABBs of the child node. However, one of these child node AABBs may been generated by a thread on the other side of the machine, and may still be stuck in an L1 cache we cannot access. To overcome this the 'globallycoherent' keyword can be used. This keyword indicates that writes should be done using some kind of write-through or write-around method until the first cache that is accessible for all the cores on the GPU. This guarantees that even though the data was written by a core that doesn't share the same L1 cache, we can still read it from L2.
Hiding Latency
Even with caches, there will still be situations where data is not in cache and has to be fetched from main memory. To deal with this, the GPU can hold more threads in flight than it can execute at once. Instead of letting a compute unit idle, it just switches to a different wave of threads to execute while it waits for the memory request to finish. By letting a shader not use too many registers or groupshared memory, a compute unit can keep many of them in flight, increasing the chance of not idling.
It is important to note though that having more waves of threads in flight on the GPU is not always better for performance. In either very ALU or memory bandwidth-intensive use cases it can cause cache trashing because the relatively small cache cannot absorb all the memory accesses from all the waves. To illustrate this, imagine a cache line getting loaded in, and by the time it is accessed again, it has already been evicted from the cache due to so many other memory accesses of other cache lines causing it to get evicted. If fewer waves are running at a time the cache line may not have been evicted by the memory accesses of another wave.