Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks

Custom CUDA kernels that eliminate the computational bottlenecks in spherical harmonics and tensor product operations - the core primitives of equivariant GNNs like MACE, NequIP, and Allegro.

The Problem: Equivariant GNNs Are Beautiful but Slow

Equivariant graph neural networks have revolutionized atomistic machine learning. Models like MACE, NequIP, and Allegro achieve state-of-the-art accuracy in molecular dynamics simulations, materials property prediction, and drug discovery. Their secret: they respect the fundamental symmetries of physical systems - rotation, translation, and reflection invariance.

But this mathematical elegance comes at a computational cost. The operations that make these models work - spherical harmonics and Clebsch-Gordan tensor products - are expensive. A single MACE layer can spend 80% of its forward pass time in these two operations.

This matters for real applications. Molecular dynamics simulations run billions of timesteps. Battery materials discovery screens millions of candidates. Drug binding affinity predictions evaluate thousands of poses. When each forward pass takes milliseconds instead of microseconds, these workflows become impractical.

Understanding the Bottleneck

To understand why equivariant GNNs are slow, we need to understand what they compute.

Spherical Harmonics: Encoding Directions

When two atoms interact, the direction of their bond matters. A carbon-carbon bond pointing "up" is physically different from one pointing "right" - and our neural network needs to know this.

... continue reading