We recently achieved 1.3-second cross-machine parameter updates for Kimi-K2 (1T parameters), transferring weights from 256 training GPUs (BF16) to 128 inference GPUs (FP8).
In asynchronous reinforcement learning fine-tuning, training and inference run on separate GPUs. After each training step, new weights must be pushed to inference nodes. Many existing frameworks take several seconds—or even minutes—for trillion-parameter models.
By leveraging RDMA point-to-point communication, we are able to make the weight transfer blazing fast, without changing inference engine, and make the code easier to write and maintain.
RDMA WRITE: one-sided transfers
Our solution is built on RDMA WRITE, a one-sided primitive where the source directly writes into the destination’s GPU memory.
def rdma_write ( src_ptr , dst_ptr , size , src_mr , dst_mr ) : ...
The destination side won’t even get notified for the transfer. This gives us low-latency, high-throughput, zero-copy transfers driven by the training nodes without any control logic on the inference nodes.
High-level workflow
Metadata collection – Controller gathers parameter metadata from all training and inference GPUs. Schedule computation – Controller computes a static weight transfer schedule, mapping which training GPU sends which parameter to which inference GPU, and in what order. Schedule distribution – Controller sends the schedule to all training GPUs. Execution – After each training step, the controller signals training GPUs to start transfers.
Weight transfer execution
... continue reading