Skip to content
Tech News
← Back to articles

Show HN: How LLMs Work – Interactive visual guide based on Karpathy's lecture

read original get AI Learning Kit → more articles
Why This Matters

This interactive visual guide demystifies how large language models (LLMs) are trained, highlighting the importance of vast, high-quality data in shaping their capabilities. Understanding this process is crucial for both developers and consumers as it influences the performance and reliability of AI applications across industries.

Key Takeaways

Chapter 1 · Pre-Training · Stage 1

Downloading

the Internet

The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.

The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly what fits on a single hard drive — representing ~15 trillion tokens.

Key Insight The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale.