I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear.
TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two consumer boxes talk fast enough to run tensor-parallel inference and FSDP workloads across both machines: ~95 Gb/s bidirectional raw RDMA, ~7 µs one-way latency, a MiniMax-M2.7 TP=2 inference run that does not fit on one box, and a Gemma 3 27B LoRA FSDP step falling from 1359 s over Ethernet to 126 s over 4-HCA USB4 RDMA.
~48 Gb/s per direction (~95 Gb/s bidi total) sustained ib_write_bw , 4-HCA aggregate at 1 MiB / 8 QPs with IOMMU off — vs ~2.3 Gb/s over the onboard 2.5 GbE and ~9 Gb/s for soft-RoCE on top of thunderbolt-net at the per-rail level.
sustained , 4-HCA aggregate at 1 MiB / 8 QPs with IOMMU off — vs over the onboard 2.5 GbE and for soft-RoCE on top of at the per-rail level. ~7 µs one-way ib_write_lat at 64 B, single QP — vs ~28 µs over RXE/2.5 GbE and ~65 µs over RXE/TBnet.
DISCLAIMER: this is research code, most of it AI-generated, and it loads experimental kernel modules on machines I was willing to crash repeatedly. I made an effort to understand enough of it to keep it on-track, but there are almost certainly false assumptions and sharp edges throughout. No warranty, no support promise, not production software.