Evals are not all you need
Published on: 2025-07-06 10:09:33
Evals are not all you need
Andrew Marble
marble.onl
[email protected]
March 3, 2025
TLDR: Evals make sense for unitless comparison between different base language models (LLMs), and have their place in testing, but the premise of using them to guarantee software performance is flawed.
What are evals?
Evals (evaluations) refers to test-based performance measurement of AI systems. For example, in a customer service chatbot, an eval could entail prompting the chatbot with a set of customer queries, scoring the results (based on helpfulness, accuracy, etc.) and aggerating the scores to give a picture of how well the chatbot performs. Here I focus on LLMs unless otherwise mentioned.
There’s been a proliferation of evaluation tools – both stand alone and incorporated into observability platforms or app building tools. My impression is that the dominant mode of building with AI is still ad-hoc development without systematic testing (“prompt and pray”); however, the conventional wisdom
... Read full article.