LLMs are getting better at character-level text manipulation

Recently, I have been testing how well the newest generations of large language models (such as GPT-5 or Claude 4.5) handle natural language, specifically counting characters, manipulating characters in a sentences, or solving encoding and ciphers. Surprisingly, the newest models were able to solve these kinds of tasks, unlike previous generations of LLMs.

Character manipulation#

LLMs handle individual characters poorly. This is due to all text being encoded as tokens via the LLM tokenizer and its vocabulary. Individual tokens typically represent clusters of characters, sometimes even full words (especially in English and other common languages in the training dataset). This makes any considerations on a more granular level than tokens fairly difficult, although LLMs have been capable of certain simple tasks (such as spelling out individual characters in a word) for a while.

To demonstrate just how poorly earlier generations handled basic character manipulation, here are responses from several OpenAI models for the prompt Replace all letters "r" in the sentence "I really love a ripe strawberry" with the letter "l", and then convert all letters "l" to "r" :

Model Response gpt-3.5-turbo I lealll love a liple strallbeelly gpt-4-turbo I rearry rove a ripe strawberly gpt-4o I rearry rove a ripe strawberrry gpt-4.1 I rearry rove a ripe strawberry gpt-5-nano I really love a ripe strawberry gpt-5-mini I rearry rove a ripe strawberry gpt-5 I rearry rove a ripe strawberry

Note that I disabled reasoning for GPT-5 models to make the comparison fairer. Reasoning helps tremendously with similar tasks (and some of the models use chain of thought directly in the output in the absence of reasoning), but I am interested in a generational uplift we observe just from raw model improvements. GPT-5 Nano is the only new generation model that makes a mistake, but given its size, it is perhaps not so surprising. Other than that, we can see that starting with GPT 4.1, models could consistently complete this task without any issues. If you’re curious about the Anthropic models, Claude Sonnet 4 is the first one to crack it. Interestingly, it was released approximately at the same time as GPT 4.1.

Counting characters#

Next, let’s take a look at counting characters. LLMs are notoriously bad at counting, so unsurprisingly, there was only one model that could count the characters reliably in the following sentence: “I wish I could come up with a better example sentence.” The only model was GPT-4.1 - others sometimes counted correctly the number of characters in all the individual words, but then fumbled adding all the numbers up. However, with reasoning set to low, GPT 5 across all sizes (incl. Nano) completes the task correctly. Similarly, Claude Sonnet models complete the task without problems if they are allowed to reason.

We see a similar story when we ask the models to count specific characters. Counting r’s in the r-ified strawberry sentence is correct most of the times for GPT 5 in all sizes, again including Nano and even without reasoning. However, it is less consistent and when you throw another curveball (such as changing strawberry to strawberrry), the results are mixed - but this time it’s not a problem of arithmetic (adding individual counts up), but rather identification of r’s in a word itself.

Base64 and ROT13#

... continue reading