Tooling
How can we debug code that doesn’t bind buffer and texture objects and doesn’t call an API to describe the memory layout explicitly? C/C++ debuggers have been doing that for decades. There’s no special operating system APIs for describing your software’s memory layout. The debugger is able to follow 64-bit pointer chains and use the debug symbol data provided by your compiler. This includes the memory layouts of your structs and classes. CUDA and Metal use C/C++ based shading languages with full 64-bit pointer semantics. Both have robust debuggers that traverse pointer chains without issues. The texture descriptor heap is just GPU memory. The debugger can index it, load a texture descriptor, show the descriptor data and visualize the texels. All of this works already in the Xcode Metal debugger. Click on a texture or a sampler handle in any struct in any GPU address. Debugger will visualize it.
Modern GPUs virtualize memory. Each process has their own page table. The GPU capture has a separate replayer process with its own virtual address space. If the replayer would naively replay all the allocations, it would get a different GPU virtual address for each memory allocation. This was fine for legacy APIs as it wasn’t possible to directly store GPU addresses in your data structures. A modern API needs special replay memory allocation APIs that force the replayer to mirror the exact GPU virtual memory layout. DX12 and Vulkan BDA have public APIs for this: RecreateAt and VkMemoryOpaqueCaptureAddressAllocateInfo. Metal and CUDA debuggers do the same using internal undocumented APIs. A public API is preferable as it allows open source tools like RenderDoc to function.
Don’t raw pointers bring security concerns? Can’t you just read/write other apps' memory? This is not possible due to virtual memory. You can only access your own memory pages. If you accidentally use a stale pointer or overflow, you will get a page fault. Page faults are possible with existing buffer based APIs. DirectX 12 and Vulkan don’t clamp your storage (byteaddress/structured) buffer addresses. OOB causes a page fault. Users can also accidentally free a memory heap and continue using stale buffer or texture descriptors to get a page fault. Nothing really changes. An access to an unmapped region is a page fault and the application crashes. This is familiar to C/C++ programmers. If you want robustness, you can use ptr + size pairs. That’s exactly how WebGPU is implemented. The WebGPU shader compiler (Tint or Naga) emits an extra clamp instruction for each buffer access, including vertex accesses (index buffer value out of bounds). WebGL didn’t allow shading index buffer data with other data. WebGL scanned through the indices on the CPU side (making index buffer update very slow). Back then custom vertex fetch was not possible. The hardware page faulted before the shader even ran.
Translation layers
Being able to run existing software is crucial. Translation layers such as ANGLE, Proton and MoltenVK play a crucial role in the portability and deprecation process of legacy APIs. Let’s talk about translating DirectX 12, Vulkan and Metal to our new API.
MoltenVK (Vulkan to Metal translation layer) proves that Vulkan’s buffer centric API can be translated to Metal’s 64-bit pointer based ecosystem. MoltenVK translates Vulkan’s descriptor sets into Metal’s argument buffers. The generated argument buffers are standard GPU structs containing a 64-bit GPU pointer per buffer binding and a 64-bit texture ID per texture binding. We can do better by allocating a contiguous range of texture descriptors in our texture heap for each descriptor set, and storing a single 32-bit base index instead of a 64-bit texture ID for each texture binding. This is possible since our API has a user managed texture heap unlike Metal.
MoltenVK maps descriptor sets to Metal API root bind slots. We generate a root struct with up to eight 64-bit pointer fields, each pointing to a descriptor set struct (see above). Root constants are translated into value fields and root descriptors (root buffers) are translated into 64-bit pointers. The efficiency should be identical, assuming the GPU driver preloads our root struct fields into uniform/scalar registers (as discussed in the root arguments chapter).
Our API uses 64-bit pointer semantics like Metal. We can use the same techniques employed by MoltenVK to translate the buffer load/store instructions in the shader. MoltenVK also supports translating Vulkan’s new buffer device address extension.
Proton (DX12 to Vulkan translation layer) proves that DirectX 12 SM 6.6 descriptor heap can be translated to Vulkan’s new descriptor buffer extension. Proton also translates other DirectX 12 features to Vulkan. We have already shown that Vulkan to Metal translation is possible with MoltenVK, transitively proving that translation from DirectX 12 to Metal should be possible. The biggest missing feature in MoltenVK is the SM 6.6 style descriptor heap (Vulkan’s descriptor buffer extension). Metal doesn’t expose the descriptor heap directly to the user. Our new proposed API has no such limitation. Our descriptor heap semantics are a superset to SM 6.6 descriptor heap and a close match to Vulkan’s descriptor buffer extension. Translation is straightforward. Vulkan’s extension also adds a special flag for descriptor invalidate, matching our HAZARD_DESCRIPTORS. DirectX 12 descriptor heap API is easy to translate, as it’s just a thin wrapper over the raw descriptor array in GPU memory.
... continue reading