TScale – distributed training on consumer GPUs
Published on: 2025-07-27 02:29:55
TScale
This repo contains transformer train and inference code written in C++ and CUDA.
TScale is designed to run on consumer hardware. To achive best results it features
Optimized transformer architecture with faster convergence and ~2x reduced attention costs
Support for fp8 and int8 model weights and activations precision
Optimized for consumer nVidia GPUs including fast reduced precision training without sacrificing model quality
CPU offload reduces GPU memory requirements for training
Sync distributed training on several same config hosts
1-bit gradient compression allowing using regular ethernet links for interconnect
Async distributed training on arbitrary hosts with negligible network traffic. In this mode training can be run on geographically separated hosts
Distributed training of 1.5B model on consumer GPU
By using inexpensive GPUs and async distributed mode TScale trains LLMs fast and affordable. Log loss for the 1.5B model trained on fineweb-edu for 2 days and $
... Read full article.