AMD GPU Debugger

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss, detailing how he achieved this, which inspired me to try to recreate the process myself.

Let’s Try To Talk To The GPU Directly

The best place to start learning about this is RADV. By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader nop 0 without using Vulkan, aka RADV in our case.

First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling amdgpu_device_initialize , which is a function defined in libdrm , which is a library that acts as middleware between user mode drivers(UMD) like RADV and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling amdgpu_cs_ctx_create from libdrm again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it:

void bo_alloc ( amdgpu_t * dev , size_t size , u32 domain , bool uncached , amdgpubo_t * bo) { s32 ret = - 1 ; u32 alignment = 0 ; u32 flags = 0 ; size_t actual_size = 0 ; amdgpu_bo_handle bo_handle = NULL ; amdgpu_va_handle va_handle = NULL ; u64 va_addr = 0 ; void* host_addr = NULL ;

Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see:

if ( domain != AMDGPU_GEM_DOMAIN_GWS && domain != AMDGPU_GEM_DOMAIN_GDS && domain != AMDGPU_GEM_DOMAIN_OA) { actual_size = (size + 4096 - 1 ) & 0x FFFFFFFFFFFFF000 ULL ; alignment = 4096 ; flags = AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED | AMDGPU_GEM_CREATE_VRAM_CLEARED | AMDGPU_GEM_CREATE_VM_ALWAYS_VALID; flags |= uncached ? (domain == AMDGPU_GEM_DOMAIN_GTT) * AMDGPU_GEM_CREATE_CPU_GTT_USWC : 0 ; } else { actual_size = size; alignment = 1 ; flags = AMDGPU_GEM_CREATE_NO_CPU_ACCESS; } struct amdgpu_bo_alloc_request req = { .alloc_size = actual_size , .phys_alignment = alignment , .preferred_heap = domain , .flags = flags , }; // memory aquired!! ret = amdgpu_bo_alloc (dev -> dev_handle , & req , & bo_handle); HDB_ASSERT ( ! ret , "can't allocate bo" );

Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here.

So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying AMDGPU_VM_MTYPE_UC , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in libdrm anyways, the function is amdgpu_bo_va_op , I opted to do it manually here and issue the IOCTL call myself:

... continue reading