Chapter 1 · Pre-Training · Stage 1
Downloading
the Internet
The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.
The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly what fits on a single hard drive — representing ~15 trillion tokens.
Key Insight The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale.