Vintage Large Language Models

This is the transcript of a short talk on vintage large language models.

A video of the talk is available here. A tweet thread is here and the slide deck is here.

Vintage Large Language Models: Transcript

I'm going to talk about vintage large language models. The idea is to introduce language models trained on past data. We are sending language models back into the past - into Roman times, the Tudor era, the 1960s, or more recent decades.

What is a Vintage LLM?

A vintage LLM is a large language model trained on texts and potentially images or other multimodal data up to a particular date. The date could be 2019, so it's trained only on data until 2019 - that's the easier case. It could be up until 1900 or even up until 200 AD, and these historical cases are much more challenging.

One challenge is having enough training data. Another is that the training data needs to be free of contamination. For a model trained up till 1900, there needs to be no information from after 1900 that leaks into the data. Some metadata might have that kind of leakage. While it's not possible to have zero leakage - there's a shadow of the future on past data because what we store is a function of what we care about - it's possible to have a very low level of leakage, sufficient for this to be interesting.

You can include multimodal data like images. There's something strange about including images when going back to Roman times or 1700 because while they had texts, they didn't have digital images. However, this is acceptable for some purposes. You'd want to avoid leaking information that could only be known in the present. You could include things people at the time could see and experience themselves. For example, there may be no anatomically accurate painting in Roman times of a bee or an egg cracking, but you can include such images because people could see such things, even if they weren't part of their recorded media. You could also have pictures of buildings and artifacts that we still have from the past.

Scientific and Epistemic Motivations

One motivation comes from science and epistemics. These applications will become increasingly important. We want to test approaches to using LLMs for prediction and scientific invention. There's various work on using LLMs for forecasting, like Halawi et al. They take an LLM and implement scaffolding with LLM agent-like behavior on top of the fine-tuned LLM. They also use information retrieval and chain of thought prompting. Various reinforcement learning and other techniques could be added to optimize the LLM for forecasting.

... continue reading