Making LLM Training Faster with Unsloth and NVIDIA

How to Make LLM Training Faster with Unsloth and NVIDIA May 6, 2026 May 6, 2026

Authors: Daniel, Michael, Mathew and Datta, with help from NVIDIA

We collabed with NVIDIA to make LLM training ~25% faster and in this blog/guide we'll breakdown exactly how we did it. These optimizations have no loss in accuracy and are an extra addition on top of Unsloth’s already 2-5x faster speedup! The new algorithms are auto enabled on RTX laptops, data center GPUs and DGX Spark machines, so just update Unsloth to get the latest improvements. By working with NVIDIA, we show how: Caching packed sequence metadata makes training 14.3% faster. Using double buffered async gradient checkpointing gives a 8% speedup. gpt-oss training is 15% faster by using argsort and bincount during MoE routing.

1. Caching Packed-Sequence Metadata Suppose we have several short examples:

Instead of padding all of them to the same length and wasting compute on padding tokens, we concatenate them into one longer packed sequence:

The model still needs to know where each original sequence starts and ends. So, alongside the packed tokens, we carry sequence metadata such as: sequence lengths

cumulative sequence offsets ( cu_seqlens )

) the maximum sequence length

attention structure derived from the three items above This is the key point: for a fixed packed batch, that metadata is the same for every layer. If we write the boundary information for a packed batch as: B = { lengths, cu_seqlens, max_seqlen, mask structure } then every transformer layer in that forward pass consumes the same B. If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again. In other words, the useful work is: build B once, use it L times. The wasteful version is: build B + build B + ⋯ + build B (L times) The overhead here is not primarily extra FLOPs. Some of these paths can force device-to-host synchronization, effectively creating a GPU-CPU sync point. Once that happens inside a per-layer path, the overhead recurs at every layer. That is what the packed-sequence caching change reduces. Instead of repeatedly reconstructing packed sequence info, SDPA packed masks, and xFormers block masks, it caches the reusable metadata and the attention-side structures derived from it, per device, for the current packed batch. Those cached structures are then reused across layers. Why this helps Packed training already improves utilization by eliminating padding waste. But if the metadata path keeps forcing synchronization, some of that gain is lost to overhead that has nothing to do with the model's actual learning. Caching helps because it removes repeated coordination work from the hot path. The forward pass benefits the most because that is where the same packed metadata is consumed repeatedly across many layers.

Benchmarks On Qwen3-14B QLoRA SFT : forward: +43.3%

... continue reading