Find Related products on Amazon

Shop on Amazon

TScale – distributed training on consumer GPUs

Published on: 2025-07-27 02:29:55

TScale This repo contains transformer train and inference code written in C++ and CUDA. TScale is designed to run on consumer hardware. To achive best results it features Optimized transformer architecture with faster convergence and ~2x reduced attention costs Support for fp8 and int8 model weights and activations precision Optimized for consumer nVidia GPUs including fast reduced precision training without sacrificing model quality CPU offload reduces GPU memory requirements for training Sync distributed training on several same config hosts 1-bit gradient compression allowing using regular ethernet links for interconnect Async distributed training on arbitrary hosts with negligible network traffic. In this mode training can be run on geographically separated hosts Distributed training of 1.5B model on consumer GPU By using inexpensive GPUs and async distributed mode TScale trains LLMs fast and affordable. Log loss for the 1.5B model trained on fineweb-edu for 2 days and $ ... Read full article.