Can modern LLMs count the number of b's in "blueberry"?

Last week, OpenAI announced and released GPT-5, and the common consensus both inside the AI community and outside is that the new LLM did not live up to the hype. Bluesky — whose community is skeptical at-best of generative AI in all its forms — began putting the model through its paces: Michael Paulauski asked GPT-5 through the ChatGPT app interface “how many b’s are there in blueberry?”. A simple question that a human child could answer correctly, but ChatGPT states that there are three b’s in blueberry when there are clearly only two. Another attempt by Kieran Healy went more viral as ChatGPT insisted blueberry has 3 b’s despite the user repeatedly arguing to the contrary.

Other Bluesky users were able to replicate this behavior, although results were inconsistent: GPT-5 uses a new model router that quietly determines whether the question should be answered by a better reasoning model, or if a smaller model will suffice. Additionally, Sam Altman, the CEO of OpenAI, later tweeted that this router was broken during these tests and therefore “GPT-5 seemed way dumber,” which could confound test results.

About a year ago, one meme in the AI community was to ask LLMs the simple question “how many r’s are in the word strawberry?” as major LLMs consistently and bizarrely failed to answer it correctly. It’s an intentionally adversarial question to LLMs because LLMs do not directly use letters as inputs, but instead they are tokenized. To quote TechCrunch’s explanation:

This is because the transformers are not able to take in or output actual text efficiently. Instead, the text is converted into numerical representations of itself, which is then contextualized to help the AI come up with a logical response. In other words, the AI might know that the tokens “straw” and “berry” make up “strawberry,” but it may not understand that “strawberry” is composed of the letters “s,” “t,” “r,” “a,” “w,” “b,” “e,” “r,” “r,” and “y,” in that specific order. Thus, it cannot tell you how many letters — let alone how many “r”s — appear in the word “strawberry.”

It’s likely that OpenAI/Anthropic/Google have included this specific challenge into the LLM training datasets to preemptively address the fact that someone will try it, making the question ineffective for testing LLM capabilities. Asking how many b’s are in blueberry is a semantically similar question, but may just be sufficiently out of domain to trip the LLMs up.

When Healy’s Bluesky post became popular on Hacker News, a surprising number of commenters cited the tokenization issue and discounted GPT-5’s responses entirely because (paraphrasing) “LLMs fundamentally can’t do this”. I disagree with their conclusions in this case as tokenization is less effective of a counterargument: if the question was only asked once, maybe, but Healy asked GPT-5 several times, with different formattings of blueberry — therefore different tokens, including single-character tokens — and it still asserted that there are 3 b’s every time. Tokenization making it difficult for LLMs to count letters makes sense intuitively, but time and time again we’ve seen LLMs do things that aren’t intuitive. Additionally, it’s been a year since the strawberry test and hundreds of millions of dollars have been invested into improving RLHF regimens and creating more annotated training data: it’s hard for me to believe that modern LLMs have made zero progress on these types of trivial tasks.

There’s an easy way to test this behavior instead of waxing philosophical: why not just ask a wide variety of LLMs see of often they can correctly identify that there are 2 b’s in the word “blueberry”? If LLMs indeed are fundamentally incapable of counting the number of specific letters in a word, that flaw should apply to all LLMs, not just GPT-5.

2 b’s, or not 2 b’s#

First, I chose a selection of popular LLMs: from OpenAI, I of course chose GPT-5 (specifically, the GPT-5 Chat, GPT-5 Mini, and GPT-5 Nano variants) in addition to OpenAI’s new open-source models gpt-oss-120b and gpt-oss-20b; from Anthropic, the new Claude Opus 4.1 and Claude Sonnet 4; from Google, Gemini 2.5 Pro and Gemini 2.5 Flash; lastly as a wild card, Kimi K2 from Moonshot AI. These contain a mix of reasoning-by-default and non-reasoning models which will be organized separately as reasoning models should theoretically perform better: however, GPT-5-based models can route between using reasoning or not, so the instances where those models reason will also be classified separately. Using OpenRouter, which allows using the same API to generate from multiple models, I wrote a Python script to simultaneously generate a response to the given question from every specified LLM n times and save the LLM responses for further analysis. (Jupyter Notebook)

In order to ensure the results are most representative of what a normal user would encounter when querying these LLMs, I will not add any generation parameters besides the original question: no prompt engineering and no temperature adjustments. As a result, I will use an independent secondary LLM with prompt engineering to parse out the predicted letter counts from the LLM’s response: this is a situation where normal parsing techniques such as regular expressions won’t work due to ambigious number usage, and there are many possible ways to express numerals that are missable edge cases, such as The letter **b** appears **once** in the word “blueberry.”

... continue reading