All Souls exam questions and the limits of machine reasoning

Oxford University is immersed in the past like no other place I’ve seen. One example: when I was a visiting student at Oxford in 2005, I remember meeting two students at a pub one evening. They were drinking ivy-laced beer. The reason, I was told, is that centuries ago, a student from Lincoln College had murdered a student of Brasenose. Ever since then, Brasenose students had been allowed into Lincoln and given free beer once a year. Here’s the event back in 1938: The actual truth behind “ivy ale day” is unclear — accounts of it usually use phrases like “some half-remembered collegiate slight.” But I found this description from 1904, which in turn cites a reference dating back to 1604. There are many more stories like this, all circling around Oxford’s unusual relationship to history and to time. The ultimate example may be All Souls College, which has a ritual, the Mallard Song, that occurs once a century. Here’s how Wikipedia describes it: In the ceremony, Fellows parade around the college with flaming torches, led by a "Lord Mallard" who is carried in a chair, in search of a giant mallard that supposedly flew out of the foundations of the college when it was being built in 1437. The procession is led by an individual carrying a duck — originally dead, now just wooden — tied to the end of a vertical pole. The ceremony was last held in 2001, with Martin Litchfield West acting as Lord Mallard. All Souls is also the home of the famed All Souls Examinations, which have been called the most difficult exams in the world. That seems an impossible thing to measure. But I do think a strong case could be made that these exams are the world’s most eccentric. In the rest of this post, I’ll argue that the All Souls exams are newly relevent as we confront the fact that frontier AI models have totally destabilized how we do academic assessment, and indeed how we think about thinking and writing in general. Most tests and exams prioritize giving the median answer — the one that experts tend to converge on as “right.” These do something different, in a way that we can learn from. Share Pedantry, Purity, Prejudice, Play Back in 2005, I thought of the college’s legendarily strange assessment system as an antiquated, Victorian-era holdover. In 2025, I see it differently. Uniquely among any other academic institution I’ve ever heard of, All Souls is not really open to students at all. It does not offer undergraduate classes, and it has no traditional admissions process, even for grad students. Instead, it offers seven-year-long Examination Fellowships. A photo of All Souls I took in 2005. Here is the selection criteria, from the All Souls website: The College sets a written examination, consisting of four papers of three hours each. Two of these are in your chosen specialist subject… The other two papers are 'general', and contain questions on a wide range of subjects. In previous years, candidates sat a fifth paper, in which they were required to write an essay in response to a single word; this is no longer the case. A few things stand out here. First, that is a lot of time to spend writing essays: twelve hours in all, sitting in a room, with pencil and paper and a short list of questions. Second, notice the (now discontinued) single word essay. Yes, three hours answering a question that consisted of a single word. I have written before about the historical research abilities of language models, which are getting quite proficient at tasks like translation, text transcription, image analysis, and making thematic and conceptual links between genres, time periods, and settings — all very useful tools for historians: But despite this, I am also struck by the continued failure of frontier models when it comes to originality in writing. This can include creative writing, but the concept of originality in writing has a much wider scope than fiction. All Soul’s College examinations very much fall under that heading. To be sure, they are designed to do multiple things at once, including testing test raw knowledge and recall in a range of disciplines (the 2024 exams had specialist sections for Classical Studies, Economics, English Literature, History, Law, Philosophy, and Politics). But it is quite clear that they also test the ability to write from an opinionated, individual point of view about these topics. This is especially clear in the “General Paper” (the questions all applicants must answer). These seem purpose-built to evade any answer which consists of rote memorization of facts. Here are the first few. You can read the complete list here: And here are some of the specialist questions in history for 2024: My honest opinion of these questions is that they are uneven and in some cases performatively “clever.” But I deeply admire the implicit ethos of education in them. Yes, you should have a storehouse of facts in your head, laboriously gained through study. But you should also be willing to write essays in the original, Michel de Montaigne sense of the term — personally grounded explorations of uncertain intellectual terrain. The question about cultural historians and ice cores is one I love for this reason: there is no “right” answer to it, no rubric that could be applied. It’s a brilliant prompting for thinking deeply about the current state of historical research and how it relates to the environmental sciences and the study of climate change. It is also, implicitly, a question that invites disagreement — I wonder what percentage of respondents argued that cultural historians shouldn’t care, and how such responses were scored. There are a lot of meta-level considerations like this embedded within these prompts, the full list of which is available here. And as for the famous single word essay prompts? Wikipedia has a complete list here. Here is the first entry: 1914: Culture And the last five: 2005: Style 2006: Water 2007: Harmony 2008: Novelty 2009: Reproduction The jagged frontier of machine reasoning Ethan Mollick often writes about the concept of a “jagged frontier” of AI reasoning ability, and that is very much in evidence here. There are many questions on the All Souls lists which the leading frontier models can answer extremely ably. For instance, GPT-5 (full answer here) and Claude Opus 4.1 (answer) can both do an excellent job on the one about Achaemenid Persia as a template for Sassanian power. Claude’s answer to this question is, in my opinion, astonishingly good, since it leverages the superhuman linguistic and geographic knowledge of LLMs to excellent effect. But give these two leading AI models the task of writing a lengthy essay on the topic of “water,” and they immediately spin off into BS. The lower-powered LLMs, like GPT-5’s non-reasoning variant, will tend to provide encyclopedia entry-type answers to these prompts, mentioning a grab bag of facts about water, or culture, or the like. That’s such an obviously bad strategy that I find it hard to imagine any human choosing it (“I have three hours to write about the concept of novelty? Let me list some things that are new…”) Claude Opus 4.1 tries to be more tactical, but it ends up sounding like a pompous windbag from the very first sentence: Water—that most paradoxical of substances, simultaneously the most common and the most extraordinary, the most transparent and the most opaque to understanding—presents itself as perhaps the supreme test of intellectual synthesis. To examine water is to confront the fundamental tensions between materiality and metaphor, between scientific precision and humanistic interpretation, between the molecular and the civilizational. GPT-5 Thinking is much the same, just with more purple prose: To write “water” is to trace a braid of physics, ecology, technology, culture, and power. This essay follows that braid. Yuck. Not long after GPT-5’s release, Michael Nielsen asked: Still, none of the AI models can write. Is this the grounding problem? The models' reality is the words they were trained on; good writers also train on lots of words, but are in addition wonderful observers of a much broader reality. The writers' world models seem much deeper as a result, and there can be a compelling unity in the writing. It's puzzling, because the models can be very helpful in critiquing and improving writing. And occasionally they produce beautiful phrases or sentences. But they still seem to me ungrounded in some important way It’s a good question. These one word prompts are a perfect example of Nielsen’s point. Consider the following words, all prompts from past versions of the All Souls exam: Pain, Sin, Space, Self-deception, Miracles, Bias, Style What do they evoke for you? It is literally impossible for me to say, because your understanding of each of them is profoundly shaped by your life experience, your sensory perception, your unconscious, your childhood, and a million other things grounded in actually operating in a physical world. By contrast, even though I found Claude Opus 4.1’s answer to the ancient Persia question shockingly good, I also noticed that it had a remarkably similar structure to GPT-5’s answer. Both began by pointing to the same inscription and making much the same argument about it. The same patterns emerge for all LLM answers to these questions. They converge on an optimal path through a thicket of concepts. They rarely, if ever, surprise. If exams like All Souls outline the negative space around LLM capabilities in humanistic reasoning — and I think they do — then that makes this format and type of reasoning more important than it was before. Being able to improvise brilliantly for three hours on the topic of Miracles or Self-Deception or Harmony might come off as unnecessary and even pompous. And to be blunt, it probably is. There’s a reason why All Souls dropped the one word questions. But — to be able to think on your feet, with originality, creativity, and verve, in a way that is grounded in both a wide-ranging knowledge of facts and a thoughtful, probing, deeply individualistic sense of your subjective opinions, memories, sense experiences, and intuitions about a topic? That pretty much sums up what I think will be the lasting value, and the new goal, of humanistic education in the 2020s and 2030s. Never again will humans be able to give a better answer than a machine on questions like the one about Achaemenid Persia. That sort of question has already had its AlphaGo moment. But not all of them have, or will. Share Weekly links • “Construction, Hotel, Hotel, Hotel, Ceremonial paper goods, Restaurant, Wine, Mint, Religious Goods, Mint”: the occupations of the world’s oldest extant companies, ranging from the sixth to 9th centuries CE. Discussed in “Slow,” an interesting list of extremely slow-burning human projects by the afore-mentioned Michael Nielsen. • A firsthand account from 2022 of an Oxford student’s experience taking the All Souls Exam, posted on Reddit “in the public interest of making the whole thing less secretive and weird.” • Sixteenth-century mariners had no real knowledge of Australia. Yet this did not stop the Spanish Empire from claiming possession of all territory from the Straits of Magellan to the South Pole, as the Governate of Terra Australis. In 1539 this land signed over to a single man, Pedro Sánchez de la Hoz. According to Wikipedia: “Given that Chile and Argentina have historically successfully established their border based on the uti possidetis iuris principle of international law, the Sánchez de la Hoz grant forms part of their arguments for territorial claims in Antarctica.” Leave a comment

All Souls exam questions and the limits of machine reasoning

Share this article

Related Articles