Weight Transfer for RL Post-Training in under 2 seconds

We recently achieved 1.3-second cross-machine parameter updates for Kimi-K2 (1T parameters), transferring weights from 256 training GPUs (BF16) to 128 inference GPUs (FP8).

In asynchronous reinforcement learning fine-tuning, training and inference run on separate GPUs. After each training step, new weights must be pushed to inference nodes. Many existing frameworks take several seconds—or even minutes—for trillion-parameter models.

By leveraging RDMA point-to-point communication, we are able to make the weight transfer blazing fast, without changing inference engine, and make the code easier to write and maintain.

RDMA WRITE: one-sided transfers

Our solution is built on RDMA WRITE, a one-sided primitive where the source directly writes into the destination’s GPU memory.

def rdma_write ( src_ptr , dst_ptr , size , src_mr , dst_mr ) : ...

The destination side won’t even get notified for the transfer. This gives us low-latency, high-throughput, zero-copy transfers driven by the training nodes without any control logic on the inference nodes.

High-level workflow

Metadata collection – Controller gathers parameter metadata from all training and inference GPUs. Schedule computation – Controller computes a static weight transfer schedule, mapping which training GPU sends which parameter to which inference GPU, and in what order. Schedule distribution – Controller sends the schedule to all training GPUs. Execution – After each training step, the controller signals training GPUs to start transfers.

Weight transfer execution

... continue reading