This rig exists to train models, not serve them. Four RTX PRO 6000 Blackwell cards in one chassis at 600 W each is 2.4 kW of heat to evict, and training runs are hours-to-days long with every card pinned at full TDP. Air coolers can do it for an inference burst; they cannot do it for a multi-day training job — the fans get loud, the cards stack their exhaust into each other, and the first one to thermal-throttle stalls the whole synchronous step.
So we converted the cards to waterblocks. We did one card first as a pilot, ran it for about a week, and only after that did we touch the other three. That sequencing matters — it’s why we have a story to tell. The pilot card failed, taught us a lesson, and the lesson is the reason the other three went on without incident.
This post is the short version: what we did, what broke, what we learned, and where we landed.
The rig#
4× RTX PRO 6000 Blackwell Workstation (GB202, 96 GB GDDR7, 600 W)
Threadripper Pro 7995WX on WRX90
4× Bykski waterblocks (full-cover, GPU + VRM + memory front-side)
Custom loop: single distro/reservoir, two pumps, distilled water, two Alphacool NexXxoS XT45 Full Copper 1260 mm Super Nova radiators (9× 140 mm fans each), four GPUs plumbed in parallel
2× 1500 W PSUs (3 kW total budget) to feed the ~2.4 kW sustained draw; AC circuit got upgraded mid-build after an earlier all-cards-down event under load
That’s one radiator. There are two of them.
... continue reading