Writing an LLM from scratch, part 10 – dropout
Published on: 2025-06-08 06:25:54
Writing an LLM from scratch, part 10 -- dropout
I'm still chugging through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered causal attention, which was pretty simple when it came down to it. Today it's another quick and easy one -- dropout.
The concept is pretty simple: you want knowledge to be spread broadly across your model, not concentrated in a few places. Doing that means that all of your parameters are pulling their weight, and you don't have a bunch of them sitting there doing nothing.
So, while you're training (but, importantly, not during inference) you randomly ignore certain parts -- neurons, weights, whatever -- each time around, so that their "knowledge" gets spread over to other bits.
Simple enough! But the implementation is a little more fun, and there were a couple of oddities that I needed to think through.
Code-wise, it's really easy: PyTorch provides a useful torch.nn.Dropout class that you create with the drop
... Read full article.