Decentralized Training Can Help Solve AI’s Energy Woes

Artificial intelligence harbors an enormous energy appetite. Such constant cravings are evident in the hefty carbon footprint of the data centers behind the AI boom and the steady increase over time of carbon emissions from training frontier AI models.

No wonder big tech companies are warming up to nuclear energy, envisioning a future fueled by reliable, carbon-free sources. But while nuclear-powered data centers might still be years away, some in the research and industry spheres are taking action right now to curb AI’s growing energy demands. They’re tackling training as one of the most energy-intensive phases in a model’s life cycle, focusing their efforts on decentralization.

Decentralization allocates model training across a network of independent nodes rather than relying on one platform or provider. It allows compute to go where the energy is—be it a dormant server sitting in a research lab or a computer in a solar-powered home. Instead of constructing more data centers that require electric grids to scale up their infrastructure and capacity, decentralization harnesses energy from existing sources, avoiding adding more power into the mix.

Hardware in harmony

Training AI models is a huge data center sport, synchronized across clusters of closely connected GPUs. But as hardware improvements struggle to keep up with the swift rise in size of large language models, even massive single data centers are no longer cutting it.

Tech firms are turning to the pooled power of multiple data centers—no matter their location. Nvidia, for instance, launched the Spectrum-XGS Ethernet for scale-across networking, which “can deliver the performance needed for large-scale single job AI training and inference across geographically separated data centers.” Similarly, Cisco introduced its 8223 router designed to “connect geographically dispersed AI clusters.”

Other companies are harvesting idle compute in servers, sparking the emergence of a GPU-as-a-Service business model. Take Akash Network, a peer-to-peer cloud computing marketplace that bills itself as the “Airbnb for data centers.” Those with unused or underused GPUs in offices and smaller data centers register as providers, while those in need of computing power are considered as tenants who can choose among providers and rent their GPUs.

“If you look at [AI] training today, it’s very dependent on the latest and greatest GPUs,” says Akash cofounder and CEO Greg Osuri. “The world is transitioning, fortunately, from only relying on large, high-density GPUs to now considering smaller GPUs.”

Software in sync

In addition to orchestrating the hardware, decentralized AI training also requires algorithmic changes on the software side. This is where federated learning, a form of distributed machine learning, comes in.

... continue reading