Tech News
← Back to articles

DeepSeek’s self-correcting AI model aces tough maths proofs

read original related products more articles

Credit: Nikolas Kokovlis/NurPhoto via Getty

Chinese artificial intelligence company DeepSeek has released a mathematical reasoning model that can identify and correct its own errors. The model beat the best human score in one of the world’s most prestigious undergraduate maths competitions.

The model, DeepSeekMath-V2, scored 118 out of 120 points on questions from the 2024 William Lowell Putnam Mathematical Competition, beating the top human score of 90. The model also performed at the level of gold-medal winners in the International Mathematical Olympiad (IMO) 2025 and the 2024 China Mathematical Olympiad. The results are described in a preprint1 posted on arXiv on 27 November.

“We are at a point where AI is about as good at maths as a smart undergraduate student,” says Kevin Buzzard, a mathematician at Imperial College London. “It is very exciting.”

In February, AlphaGeometry 2, an AI problem solver created by Google DeepMind in London, also achieved a gold-level performance in the IMO. The feat was repeated in July by Gemini’s Deep Think, which is owned by DeepMind.

Reasoning over answers

Early approaches to training large language models for mathematical reasoning focused on the accuracy of final answers, the preprint authors write. But a correct answer does not guarantee correct reasoning. At times, a correct final answer might just be a result of a fortunate error. Moreover, an exclusive focus on the end result is not useful in proving mathematical laws or formulae, when the logical reasoning is more important than the final answer.

Tong Xie, a chemist specialising in AI-driven discoveries at UNSW Sydney in Australia, says the researchers behind DeepSeek, as well as those developing Gemini’s Deep Think, have been working on overcoming this problem by rewarding reasoning over the final answer.

DeepSeekMath-V2 introduces self-verifiable mathematical reasoning for the first time. The model consists of a verifier trained to evaluate mathematical proofs — which are built on a series of step-by-step deductions — to identify logical flaws and assign scores based on how rigorous the proof was. A meta-verification system then checks whether the verifier’s critiques are accurate, reducing the likelihood of hallucinations and improving trustworthiness. These components work with a proof generator that constructs solutions and evaluates its own work, refining arguments until no further issues can be found.

The design creates a feedback loop: the verifier improves the generator, and as the generator produces more-challenging proofs, these become new training data to strengthen the verifier.

... continue reading