Think of a Number

My feed was recently clogged up with news articles reporting that Sam Altman thinks that AGI is here, or will be here next year, or whatever. I will refrain from giving even more air to this nonsense by linking to the stories. This kind of irresponsible hype-generation drives me nuts (although it also drives up stock prices so I can see why the tech bros are motivated to do it). Sure AI can have a good crack at undergraduate mathematics right now, and sure that’s pretty amazing. But our universities are full of students who can also have a good crack at undergraduate mathematics so in some sense we have achieved very little (and certainly nothing which is of any use to me as a working mathematician in 2025). If AI cannot do mathematics at PhD-student level (i.e., if it can’t start thinking for itself) then AGI is certainly not here, whatever “AGI” even means.

In an attempt to move beyond the hype and to focus on where we are right now, I’m going to try an experiment. It might fail. But if you’re a number theorist reading this, you can help me make it succeed.

Can AI do mathematics?

The null hypothesis as far as I am concerned is “not really”. More precisely, I have seen absolutely no evidence that language models are doing anything other than paraphrasing and attempting to generalise what they’ve seen on the internet, not always correctly. Don’t get me wrong — this gets you a long way! Indeed many undergraduate mathematics exams at university level are designed so that students who have a basic knowledge of the ideas in the course will pass the exam. The exam will test this knowledge by asking the student either to regurgitate the ideas, or to apply them in situations analogous to those seen in the problem sheets which came with the course. You can pass these exams by intelligent pattern-matching (if you have an encyclopedic knowledge of the material, which a machine would have). Furthermore, undergraduate pure mathematics is stagnant. The courses (other than my Lean course) which we teach in the pure mathematics degree at Imperial College London are essentially exactly the same as those which were offered to me when I was an undergraduate in Cambridge in the late 1980s, and very little post-1960 mathematics is taught at undergraduate level, even at the top institutions in the world, because it simply takes too long to get there. This is exactly why language models can pass undergraduate mathematics exams. But as I already said, this is no use to me as a working mathematician and it is certainly not AGI.

How do we test AI?

If you look at currently available mathematics databases, they are essentially all focused on mathematics which is at undergraduate or olympiad-style level. Perhaps the tech bros are deluded in thinking that because their models are getting high marks on such databases then they’re doing well. I want to firstly stress the fundamental notion that these databases are completely unrepresentative of mathematics. More recently we had the FrontierMath dataset, which was designed by mathematicians and which looked like a real step in the right direction, even though the answers to the questions were not proofs or ideas, but just numbers (which is also a long way from what research mathematics looks like); the dataset is private, but the 5 sample questions which were revealed to us were clearly beyond undergraduate level. People (including me) were tricked into saying publically that any machine which can do something nontrivial here would really be a breakthrough. The next thing we know, OpenAI were claiming that their system could get 25 percent on this database. Oof. But shortly after that, Epoch AI (who put he dataset together) revealed that 25 percent of the questions were undergraduate or olympiad level, and this week it transpires that OpenAI were behind the funding of the dataset and reportedly were even given access to some of the questions (added later: in fact Elliot Glazer from Epoch AI has confirmed this on Reddit); all of a sudden 25 percent doesn’t sound so great any more.

So maybe we need to do it all again

Let’s make a database

I want people (i.e. researchers in number theory at PhD level or beyond) to help me put together a secret database of hard number theory problems. I already have 5 but I need at least 20 and ideally far more, so I need help. The problems need to be beyond undergraduate level in the sense that undergraduates will not have been taught all the skills necessary to solve them. Because LLMs are not up to writing proofs, the answers unfortunately need to be non-negative integers: there is simply no point asking an LLM basic research level questions and then ploughing through the crap they produce and giving it 0/10; anyone who thinks that LLMs are actually capable of writing proofs should show me one that can do the Putnam because all efforts I saw on the 2024 exam were abysmal and this is undergraduate level: as always, the difficulty here is that soon after the questions and solutions are made public, the models train on them and the experiment is thus invalidated. In particular there is no point asking a model how to solve the 2024 Putnam questions now, one would expect perfect solutions, parroted off the internet by the parrots which we are being told are close to AGI. Note that restricting to questions for which the answer is a non-negative integer is again moving in a direction which is really really far from what researchers actually do, but unfortunately this is the level we are at right now.

Once we have a decent amount of questions, which perhaps means at least 20 and maybe means 50, I’ll announce this and then any AI company who wants a go can get in touch and I’ll send them the questions but with no solutions and they can send me back the answers and then I’ll announce the company and the score they got publically. Each company is allowed one go. When a few have had a go, I’ll make the questions public.

... continue reading