Show HN: How LLMs Work – Interactive visual guide based on Karpathy's lecture

Chapter 1 · Pre-Training · Stage 1

Downloading

the Internet

The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.

The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly what fits on a single hard drive — representing ~15 trillion tokens.

Key Insight The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale.