Fault Tolerant Llama training – PyTorch blog
Collaborators: Less Wright, Howard Huang, Chien-Chin Huang, Crusoe: Martin Cala, Ethan Petersen tl;dr: we used torchft and torchtitan to train a model in a real-world environment with extreme synthetic failure rates to prove reliability and correctness of fault tolerant training Training loss across 1200 failures with no checkpoints. NOTE: Each small spike is a non-participating worker recovering which affects the metrics but not the model Introduction We want to demonstrate torchft in wo