The bug that taught me more about PyTorch than years of using it

a loss plateau that looked like my mistake turned out to be a PyTorch bug. tracking it down meant peeling back every layer of abstraction, from optimizer internals to GPU kernels.

Expected to fix: my hyperparameters. Actually had to fix: PyTorch backend.

My training loss plateaued and wouldn’t budge. Obviously I’d screwed something up. I tried every hyperparameter combination, rewrote my loss function, spent days assuming I’d made some stupid mistake. Because it’s always user error.

This time, it wasn’t. It was a niche PyTorch bug that forced me through layers of abstraction I normally never think about: optimizer internals, memory layouts, dispatch systems, kernel implementations. Taught me more about the framework than years of using it.

I had a surprisingly fun time with this bug hunt and wrote up the whole investigation step-by-step, explaining framework internals as they become necessary to crack the case. If you enjoy debugging mysteries or find that tracking down bugs teaches you more than docs ever could, this might resonate. 🕵️‍♀️

Debugging post-mortems sometimes make me worry I wouldn’t have been smart enough to figure them out myself. So I structured this walkthrough to show the reasoning behind each step: what clues suggested each move, why I tested that hypothesis, why certain results pointed where they did. While the investigation took time and persistence, it didn’t require any particular expertise or wizardry— just observation and willingness to keep digging. I’ve included background knowledge exactly when you need it to understand the next step—think of it as an excuse to learn (or re-learn) PyTorch internals through a real problem. If you’d prefer to jump straight to reproducing the bug yourself, check out the minimal reproduction script and walkthrough on GitHub. Otherwise, join me on the investigation!

Table of Contents: 🤔 The Mystery: A Plateauing Loss…… 🔎 Isolating the Problem…… 💻 Device-Specific Differences…… ⌺ Tensor Memory Layouts…… 💔 Identifying the Broken Operations……. 🍎 Inside the Kernel Implementation…… 🕵️‍♀️ Case Closed

TL;DR - Just tell me the bug The Bug: A PyTorch GPU kernel bug silently failed when writing to non-contiguous memory, causing my model’s encoder weights to freeze during training on Apple Silicon (MPS backend, PyTorch <2.4). The Technical Details: PyTorch’s MPS (Apple Silicon GPU) backend had a kernel bug where addcmul_ and addcdiv_ operations silently fail when writing to non-contiguous output tensors. Why It Caused the Training Plateau: Encoder weights initialized as transpose of decoder → non-contiguous memory layout

Adam’s state tensors inherited this layout ( exp_avg and exp_avg_sq became non-contiguous)

and became non-contiguous) MPS kernels for addcmul_ / addcdiv_ don’t handle non-contiguous outputs correctly

... continue reading