Posted on July 29, 2025
Futhark is a programming language meant for writing fast programs, but as is the case for every programming language meant for writing fast programs, it inevitably happens that a programmer will use it to write a program that is not fast. When this happens, the programmer will likely want to know why their program is not fast, and how to make it faster. A useful tool for answering such questions is a profiler - a tool that tells you how long the different parts of your program take to run. This post is about how profiling in Futhark became slightly more useful with the most recent release.
Initially, Futhark had no real profiling support, except for some semi-documented support for dumping a report of GPU operations. Eventually we added futhark profile , which allows the machine-readable profiling data produced by futhark bench to be turned into human-readable reports. Specifically, the Futhark runtime system will tally up the time spent in various cost centres, which for the GPU backends are GPU kernels and other operations such as copies, and put it in a table. However, the information you get out still looks like this:
Cost centre count sum avg min max fraction ------------------------------------------------------------------------------------------------------------------------------------ builtin#replicate_i8.replicate_25280 63 346.11μs 5.49μs 5.12μs 7.17μs 0.0019 copy_dev_to_dev 6 80.90μs 13.48μs 8.19μs 27.65μs 0.0004 copy_lmad_dev_to_dev 1 8.19μs 8.19μs 8.19μs 8.19μs 0.0000 initOperator_6347.gpuseq_25229 2 20.48μs 10.24μs 9.22μs 11.26μs 0.0001 initOperator_6347.replicate_25196 2 22.53μs 11.26μs 11.26μs 11.26μs 0.0001 initOperator_6347.segmap_19237 2 31.74μs 15.87μs 14.34μs 17.41μs 0.0002 main.gpuseq_26401 63 382.98μs 6.08μs 5.12μs 7.17μs 0.0021 main.gpuseq_26416 63 344.06μs 5.46μs 5.12μs 7.17μs 0.0019 main.replicate_25225 1 184.32μs 184.32μs 184.32μs 184.32μs 0.0010 main.segmap_19296 1 23.55μs 23.55μs 23.55μs 23.55μs 0.0001 main.segmap_19320 1 13.31μs 13.31μs 13.31μs 13.31μs 0.0001 main.segmap_23375 1 13.31μs 13.31μs 13.31μs 13.31μs 0.0001 main.segmap_23407 63 408.58μs 6.49μs 6.14μs 7.17μs 0.0022 main.segmap_23494 63 421.89μs 6.70μs 6.14μs 8.19μs 0.0023 main.segmap_23553 63 9211.90μs 146.22μs 144.38μs 147.46μs 0.0501 main.segmap_intrablock_23610 63 100120.57μs 1589.22μs 1573.89μs 1616.90μs 0.5445 main.segmap_intrablock_24377 63 60806.14μs 965.18μs 955.39μs 985.09μs 0.3307 main.segscan_23448 63 785.41μs 12.47μs 11.26μs 13.31μs 0.0043 map_transpose_4b 127 10655.74μs 83.90μs 80.90μs 89.09μs 0.0579 ------------------------------------------------------------------------------------------------------------------------------------
Now a user may reasonably object: “Hold on! I don’t remember my program containing anything called main.segmap_23494 !” And indeed, these cost centres refer to compiler-generated names. You can squint to get some meaning out of them: segscan is certainly some kind of scan operation, and segmap is a map . But due to inlining, it can be difficult to guess which functions result in which GPU operations, and optimisations may obscure the relation between source code and generated code - indeed, those segmap_intrablock operations are actually mainly (nested) scan s that are then turned into block-level scans via incremental flattening. But clearly it is still not easy to use this information. The profiler will usually just tell the programmer that their program spends all its time executing code with a name the programmer cannot possibly recognise. What is missing is a way to relate generated code with the original source code. I decided to call such information provenance, in the sense of “the ultimate origin of something”. The problem is then to attach provenance to every bit of generated code, and in particular, to the generated GPU kernels.
Unfortunately, the Futhark compiler was never really designed to track provenance. The frontend always annotated all syntactic constructs with precise source information in order to do good error reporting, but it was thrown away during desugaring and lowering into the intermediate language (IR). To improve the usability of profiling, we had to actually propagate source information throughout the entire compiler - and I was quite uncertain about how practical it would be to make such a change to a ten year old code base comprising over a hundred thousand source lines of Haskell. The main challenge is that the Futhark compiler does a lot of optimisations where it rewrites part of the program. Manually propagating provenance is simply not viable or maintainable. We had to come up with some kind of semi-automated general scheme.
Further, there were some conceptual questions about how to track provenance across the compiler, due to the aggressive transformations it does. For example, the user may write something like this:
let b = map f a b = map f a let c = map g b c = map g b let d = map h c d = map h c
The compiler may then perform fusion to essentially turn the program into this:
let d = map (\x -> h (g (f x))) a d = map (\x -> h (g (f x))) a
... continue reading