Learning About GPUs Through Measuring Memory Bandwidth
At Traverse Research, we need to have a deep understanding of GPU performance to develop our benchmark, Evolve. Additionally, we sometimes do projects for very specific hardware where we need to know all the ins and outs of this hardware. One way we do this is by using microbenchmarks to measure specific parts of the GPU to get new insights. In this article, we will share what we learned from measuring the memory bandwidth of various GPUs. First we will be going over some background information about GPU hardware relating to loading from and storing to memory, then we will take a look at how our microbench is built, and finally we will look at some GPUs of which we measured the bandwidth and what we learned from that.
Background Information
Accessing memory on a GPU is quite a bit more complicated than on a CPU. In this section we will talk about different concepts that are handy to keep in mind when programming for a GPU.
Descriptors
Memory on a GPU is usually not directly accessed via a pointer like on a CPU. Although some hardware is capable of doing this, buffer and textures access usually happens via a descriptor. A descriptor is nothing more than a pointer with extra metadata to support more complex logic when fetching data. For example for a texture the hardware needs to know the resolution, swizzle pattern, format, number of mip levels, whether the texture uses MSAA, and more to be able to load from it. This is all encoded in the descriptor. How this descriptor is represented in binary is up to the hardware vendor and is thus something we generally cannot see directly. Buffers are also accessed via a descriptor but often only encode a pointer with a size. If you ever wondered how the hardware is able to return a default value when reading out of bounds, this is how.
Types of buffers
When talking about buffers there are a couple of distinctions we need to make between the various sorts, since each have their own advantages and disadvantages.
1. Byte Address Buffers
The most basic form is a Byte Address Buffer, or sometimes called the Raw Buffer, this type allows us to load any data type in the shader by passing a byte offset. However, that is not the complete story. GPUs are generally not able to access non-4-byte aligned data, so the byte offset in reality has to be a multiple of 4. Additionally, most hardware is able to load in data in chunks of 4, 8 and 16 bytes. This reduces the number of load requests. Some hardware is able to load these larger chunks from any 4-byte aligned address, but not all hardware is able to do so. Since Byte Address Buffers do not give any guarantees in terms of alignment the shader compiler may generate four 4-byte loads instead of a single 16-byte load.
... continue reading