The TensorPool Agent is currently in beta. We’d love your feedback!
The TensorPool Agent is an autonomous monitoring and recovery system for long-running distributed training jobs on Kubernetes, Slurm, or TensorPool Jobs. It’s designed for large multi-node training jobs that run for days to weeks.
When the TensorPool Agent detects a runtime error, it attempts to autonomously recover your training job from its last checkpoint. You explicitly whitelist the actions the TensorPool Agent can take on your behalf.
Best case: The TensorPool Agent recovers your training job when you are AFK, letting you get more iteration cycles and avoid burning GPU hours.
Worst case: The TensorPool Agent delivers a preliminary root cause analysis and the actions it would have taken.
Target Failures
The TensorPool Agent is designed to address runtime errors that occur deep into training:
GPU hardware faults: Xid errors (79, 63, 48, etc.)
Distributed communication failures, NCCL errors
Infrastructure problems: hardware failures, kernel panics
... continue reading