Writing an LLM from scratch, part 13 – attention heads are dumb
Published on: 2025-07-12 21:06:02
Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb
Now that I've finished chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- having worked my way through multi-head attention in the last post -- I thought it would be worth pausing to take stock before moving on to Chapter 4.
There are two things I want to cover, the "why" of self-attention, and some thoughts on context lengths. This post is on the "why" -- that is, why do the particular set of matrix multiplications described in the book do what we want them to do?
As always, this is something I'm doing primarily to get things clear in my own head -- with the possible extra benefit of it being of use to other people out there. I will, of course, run it past multiple LLMs to make sure I'm not posting total nonsense, but caveat lector!
Let's get into it. As I wrote in part 8 of this series:
I think it's also worth noting that [what's in the book is] very m
... Read full article.