How to Evaluate LLMs and GenAI Workflows Holistically

Background

LLMs and GenAI are now assisting professionals in more & more workflows in many different settings – whether that be large companies, financial institutions, academic research and even in high-stakes industries such as healthcare and law. The outputs that LLMs produce within these workflows influence decisions that have real consequences. Meanwhile, however, standards for how to evaluate the performance of these AI systems or workflows haven’t necessarily kept pace with the speed of real-world deployment.

Why Evaluation Is Important

Generative LLM tasks typically don’t have a single, absolute ‘correct’ answer. Certain standard metrics like accuracy, F1 scores (a harmonic mean of precision and recall), and BLEU scores (a measure of coherence) have emerged to evaluate LLM outputs, but their usefulness is limited. For instance, F1 scores help evaluate how well an LLM performs on classification tasks such as classifying an email as spam or sentiment analysis, but fall flat when it comes to giving signals about the LLM/AI system’s quality of reasoning, contextual relevance, clarity of writing, or instruction following. High scores on these standard metrics can tend to add a false sense of security about the LLM’s performance which is not good for critical business workflows, where a more nuanced and thorough approach is needed to evaluate whether an LLM/Gen AI workflow is performing well.

Introducing Evals

Evals (short for Evaluations) help you assess how well an LLM or AI model’s output aligns with what the task or the user actually needs. You need evals in order to measure how effectively your LLM is performing in the context of the workflow it is embedded in and the task it is supposed to perform. Evals go beyond correctness or factual accuracy to different dimensions of LLM/Gen AI workflow performance such as usefulness, clarity, instruction following, reliability, actionability, and ethical alignment (for a deeper dive, see AI ethics and LLMs).

Because different domains prioritize different kinds of ‘quality’, evaluations become the subjective anchor of quality, helping answer ‘Would a domain expert trust and use this output?’. Furthermore, evals then become a way for businesses to capture the complexity and nuances of the specific business processes within which these LLMs and GenAI-powered workflows are embedded and measure whether they are doing what they’re supposed to.

Given the high stakes of GenAI powered workflows and the $ going into the investment to set them up, it is incumbent upon businesses and the right roles to create robust evaluations so they can get some meaningful signals about the performance of their many different agents and AI systems at a massive scale.

Where to Start with Evals

The first place to start when working on coming up with evals is to think about the dimensions relevant to evaluating the performance of the LLM within the scenario or business process it’s plugged into.

... continue reading