Better Models: Worse Tools

A very strange Pi issue sent me down a rabbit hole over the last two days. The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested edits[] array. And not Haiku or some small model: Opus 4.8. The edit itself is usually correct but the arguments do not match the schema as the model invents made-up keys and Pi thus rejects the tool call and asks to try again.

That alone is not too surprising as models emit malformed tool calls sometimes. Particularly small ones. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.

In case you are curious about Fable: I intentionally did not test it because I was not sure if the classifiers they are running might downgrade me to Opus silently.

Tool Calls Are Text

If you have not spent too much time looking at LLM tool calling internals, the important thing to understand is that tool calls are not magic and use some rather crude in-band signalling. The model receives a transcript, a system prompt and a list of available tools. The server munches that into a large prompt with special marker tokens. Because the model was trained and reinforced on examples of that format, at some point during generation it emits something that the API or client interprets as “call this tool with these arguments”.

For a file edit tool, the intended invocation payload might say something like this:

{ "path" : "some/file.py" , "edits" : [ { "oldText" : "text to replace" , "newText" : "replacement text" } ] }

A harness then validates the arguments, performs the edit, and feeds the result back into the model. If validation fails, the model sees an error and usually tries again.

How exactly that formatting happens is not known for the Anthropic models, but some people have gotten out “ANTML” markers and they at times do leak also into public communications. To the best of my knowledge, the call above would come out serialized like this from the model:

... continue reading