A new Google model is nearly perfect on automated handwriting recognition

Google has a webapp called AI Studio where people can experiment with prompts and models. In the last week, users have found that every once in awhile they will get two results and are asked to select the better one. The big AI labs typically do this type of A/B testing on new models just before they’re released, so speculation is rampant that this might be Gemini-3. Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

Curious, I tried it out on transcribing some handwritten texts and the results were shocking: not only was the transcription very nearly perfect—at expert human levels—but it did a something else unexpected that can only be described as genuine, human-like, expert level reasoning. It is the most amazing thing I have seen an LLM do, and it was unprompted, entirely accidental.

What follows are my first impressions of this new model with all the requisite caveats that entails. But if my observations hold true, this will be a big deal when it’s released. We appear to be on the cusp of an era when AI models will not only start to read difficult handwritten historical documents just as well as expert humans but also analyze them in deep and nuanced ways. While this is important for historians, we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required. As is so often the case with AI, that is exciting and frightening all at once. Even a few months ago, I thought this level of capability was still years away.

A New Model

Rumours started appearing on X a week ago that there was a new Gemini model in A/B testing in AI Studio. It’s always hard to know what these mean but I wanted to see how well this thing would do on handwritten historical documents because that has become my own personal benchmark. I am interested in LLM performance on handwriting for a couple of reasons. First, I am a historian so I intuitively see why fast, cheap, and accurate transcription would be useful to me in my day-to-day work. But in trying to achieve that, and in learning about AI, I have come to believe that recognizing historical handwriting poses something of a unique challenge and a great overall test for LLM abilities in general. I also think it shines a small amount of light on the larger question of whether LLMs will ultimately prove capable of expert human levels of reasoning or prove to be a dead end. Let me explain.

Most people think that deciphering historical handwriting is a task that mainly requires vision. I agree that this is true, but only to a point. When you step back in time, you enter a different country, or so the saying goes. People talk differently, using unfamiliar words or familiar words in unfamiliar ways. People in the past used different systems of measurement and accounting, different turns of phrase, punctuation, capitalization, and spelling. Implied meanings were different as were assumptions about what readers would know.

While it can be easy to decipher most of the words in a historical text, without contextual knowledge about the topic and time period it’s nearly impossible to understand a document well-enough to accurately transcribe the whole thing—let alone to use it effectively. The irony is that some of the most crucial information in historical letters is also the most period specific and thus hardest to decipher.

Even beyond context awareness, though, paleography involves linking vision with reasoning to make logical inferences: we use known words and thus known letters to identify uncertain letters. As we shall see, documents very often become logic puzzles and LLMs have mixed performance on logic puzzles, especially novel formulations they have not been trained on. For this reason, it has been my intuition for some time that models would either solve the problem of historical handwriting and other similar problems as they increased in scale, or they would plateau at high but imperfect, sub-human expert levels of accuracy.

Prediction Can Only Get You So Far…

I don’t want to get too technical here, but it is important to understand why these types of things are so hard for LLMs and why the results I am reporting here are significant. Since the first vision model, GPT-4, was released in February 2023, we’ve seen HTR scores steadily improve to the point that they get about 90% (or more) of a given text correct. Much of this can be chalked up to technical improvements in image processing and better training data, but it’s that last 10% that I’ve been talking about above.

... continue reading