Triton Bespoke Layouts

Hopefully the previous articles covering linear layout concepts and examples facilitate building a solid understanding of the core generic layer powering various Triton code generation lowering and optimizations. Now let’s turn our focus to those bespoke layouts, which we still consistently interact with when working on Triton compiler internals. Additionally, developers can directly program layouts with Gluon now; writing those bespoke layouts is generally more intuitive than linear layouts.

By bespoke layouts, I mean traditional layouts like blocked/shared/MMA layouts. In certain places they are also called legacy layouts. Given we still actively use them and there are no plans to deprecate them, I personally prefer calling them bespoke layouts, to emphasize the fact that each one of them is tailored towards a specific need.

Bespoke vs Linear Layouts

One would ask why we need two sets of layouts and what different purposes they serve, if any.

Chronologically, we only have those bespoke layouts at the beginning. They model key hardware tensor ownership patterns in a straightforward manner. They are easy to understand and get the job done for common cases. However, as the kernel becomes more and more complicated which invites more and more optimizations, their shortcomings start to become obvious—given each bespoke layout uses its own IR definition and underlying mechanism, we need more and more point to point conversion cases.

Starting with the general ttg.convert_layout operation we mentioned earlier as an example, it can have different source and destination layouts. Without a generic mechanism, we need to consider them separately and use different code paths, which means solving a combinational problem in the space of source layout (blocked, shared, MMA, etc.) * destination layout (blocked, shared, MMA, etc.) * data exchange (intra-thread, intra-warp, inter-warp, etc.).

ttg.convert_layout serves as internal bridge inside the compiler for potential data ownership exchanges—we insert it as long as we have a mismatch in the type system due to layouts. Such approach gives us localized compiler transformations so easier to manage. On the flip side, it does mean that there can exist lots of redundant conversions; we would want to optimize them away if possible.

Further, from the kernel’s perspective, we write at the block level and process n-D tensors. It’s quite common to perform .permute() , .reshape() , and other shape manipulation operations. These operations conceptually bring no cost given they are just creating derivative “views” of the original tensor without needing to really shuffle data in the hardware. So we would like to optimize through them if possible when optimizing layout conversions.

For the compiler to realize the above, it needs to reason and compute how element ownership transfers throughout the kernel code, which would be hard if we don’t have a unified mechanism. Therefore linear layout was introduced as a generic underlying mechanism.

Genericity comes with a cost of higher cognitive burden though, as human minds are typically fond of vivid illustrations rather than terse theories. Especially now with Gluon, developers can directly program layouts to get precise control to overrule inefficiencies of compiler heuristics. So even as all the Triton compiler internals are transitioning to heavily rely on linear layouts for generic optimizations, bespoke layouts are still great complementary mechanisms. It’s like that we know all high-level programming languages are translated into assembly eventually, but we still prefer programming the former.

... continue reading