Gluon: Explicit Performance

Gluon enhances the Triton language and compiler solutions with an additional approach towards GPU kernel programming. It strikes a different balance in the portability and performance spectrum to expose more compiler internals; thus giving developers more explicit controls to reach higher performance ceiling. In this blog post I’ll explain Gluon per my understanding. I will also use this as an opportunity to talk about domain-specific languages, particularly in the context of dramatically evolving agentic software development.

Gluon and Triton

Let’s start by introducing Gluon. Like previous blog posts in this series, I’ll discuss the overall structures and design choices and explain my mental model of it, instead of sweeping language features and providing a guide of how to get started on implementing specific kernels for a particular problem. For the latter, there already exists great talks and NVIDIA GPU-specific tutorials; we are also developing similar ones for AMD GPUs.

Frontend to Triton GPU IR

Inside the Triton compiler, we have three levels of IRs, namely, Triton tt , Triton GPU ttg , and LLVM llvm . Triton can be seen as the Python frontend to the tt IR: we mechanically parse Python AST using visitor pattern to emit corresponding tt ops. There are a few layers during this procedure:

@triton.jit (decorator) # python/triton/runtime/jit.py → JITFunction._do_compile() # python/triton/runtime/jit.py → ASTSource.make_ir() # python/triton/compiler/compiler.py → ast_to_ttir() # python/triton/compiler/code_generator.py → TritonSemantic # python/triton/language/semantic.py → ir.builder # python/src/ir.cc

Similary, Gluon is effectively the Python frontend to the ttg IR. It defines its own @gluon.jit decorator, GluonJITFunction (subclass of JITFunction ), and GluonASTSource (subclass of ASTSource ) in its _runtime.py . Though overall it shares much with the above mechanical AST parsing and IR generation flow, particularly, the core CodeGenerator class. In it, we only differentiate the plugged in IR builder and semantic instance.

The following is an overall JIT compilation flow comparison between Triton and Gluon:

Triton: @triton.jit → ASTSource → ast_to_ttir() → CodeGenerator + TritonSemantic + ir.builder → ttir → ttgir → llir → ptx/amdgcn → cubin/hsaco Gluon: @gluon.jit → GluonASTSource → ast_to_ttir() → CodeGenerator + GluonSemantic + GluonOpBuilder → ttgir → llir → ptx/amdgcn → cubin/hsaco

As you can see from the above, the major difference is skipping tt IR and directly exposing and building ttg IR from Python. It naturally means that now developers gain access to low-level explicit controls which were previously hidden inside the compilers. It also means that now developers need to handle the optimizations previously performed by the tt to ttg conversion.

... continue reading