How large are large language models?

How large are large language models? (2025)

This aims to be factual information about the size of large language models. None of this document was written by AI. I do not include any information from leaks or rumors. The focus of this document is on base models (the raw text continuation engines, not 'helpful chatbot/assistants'). This is a view from a few years ago to today of one very tiny fraction of the larger LLM story that's happening.

History

GPT-2,-medium,-large,-xl (2019): 137M, 380M, 812M, 1.61B. Source: openai-community/gpt2. Trained on the unreleased WebText dataset said to 40GB of Internet text - I estimate that to be roughly 10B tokens. You can see a list of the websites that went into that data set here domains.txt.

(2019): 137M, 380M, 812M, 1.61B. Source: openai-community/gpt2. Trained on the unreleased WebText dataset said to 40GB of Internet text - I estimate that to be roughly 10B tokens. You can see a list of the websites that went into that data set here domains.txt. GPT-3 aka davinci, davinci-002 (2020): 175B parameters. There is a good breakdown of how those parameters are 'spent' here How does GPT-3 spend its 175B parameters?. Trained on around 400B tokens composed of CommonCrawl, WebText2, Books1, Books2 and Wikipedia. Source Language Models are Few-Shot Learners. These training runs required months of a data center full of tens of thousands of A100 GPUs source.

(2020): 175B parameters. There is a good breakdown of how those parameters are 'spent' here How does GPT-3 spend its 175B parameters?. Trained on around 400B tokens composed of CommonCrawl, WebText2, Books1, Books2 and Wikipedia. Source Language Models are Few-Shot Learners. These training runs required months of a data center full of tens of thousands of A100 GPUs source. GPT-3.5, GPT-4 (2022, 2023): No official factual information on architecture or training data available.

Llama

*Llama 7B, 13B, 33B, 65B: The 65B model was pretrained on a 1.4T (trillion tokens) dataset. LLaMA was officially stated to use Books3 source as a data set - this is a very important dataset which has been pivotal in lawmaking regarding the training of AIs on large amounts of copyrighted and potentially pirated material.

Llama-3.1 405B (2024): The 405B llama model was released. This is a dense transformer model, meaning all parameters are used in inference passes. Initial pretraining: 2.87T tokens, long context: 800B, annealing: 40M - so 3.67T total. source: The Llama 3 Herd of Models. By this point meta has learned to say less about what data goes into the models "We create our dataset for language model pre-training from a variety of data sources containing knowledge" - so I can't say as much about what goes into the training data here.

Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks

... continue reading