Tech News
← Back to articles

Open-source MCPEval makes protocol-level agent testing plug-and-play

read original related products more articles

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Enterprises are beginning to adopt the Model Context Protocol (MCP) primarily to facilitate the identification and guidance of agent tool use. However, researchers from Salesforce discovered another way to utilize MCP technology, this time to aid in evaluating AI agents themselves.

The researchers unveiled MCPEval, a new method and open-source toolkit built on the architecture of the MCP system that tests agent performance when using tools. They noted current evaluation methods for agents are limited in that these “often relied on static, pre-defined tasks, thus failing to capture the interactive real-world agentic workflows.”

“MCPEval goes beyond traditional success/failure metrics by systematically collecting detailed task trajectories and protocol interaction data, creating unprecedented visibility into agent behavior and generating valuable datasets for iterative improvement,” the researchers said in the paper. “Additionally, because both task creation and verification are fully automated, the resulting high-quality trajectories can be immediately leveraged for rapid fine-tuning and continual improvement of agent models. The comprehensive evaluation reports generated by MCPEval also provide actionable insights towards the correctness of agent-platform communication at a granular level.”

MCPEval differentiates itself by being a fully automated process, which the researchers claimed allows for rapid evaluation of new MCP tools and servers. It both gathers information on how agents interact with tools within an MCP server, generates synthetic data and creates a database to benchmark agents. Users can choose which MCP servers and tools within those servers to test the agent’s performance on.

The AI Impact Series Returns to San Francisco - August 5 The next phase of AI is here - are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows - from real-time decision-making to end-to-end automation. Secure your spot now - space is limited: https://bit.ly/3GuuPLF

Shelby Heinecke, senior AI research manager at Salesforce and one of the paper’s authors, told VentureBeat that it is challenging to obtain accurate data on agent performance, particularly for agents in domain-specific roles.

“We’ve gotten to the point where if you look across the tech industry, a lot of us have figured out how to deploy them. We now need to figure out how to evaluate them properly,” Heinecke said. “MCP is a very new idea, a very new paradigm. So, it’s great that agents are gonna have access to tools, but we again need to evaluate the agents on those tools. That’s exactly what MCPEval is all about.”

How it works

MCPEval’s framework takes on a task generation, verification and model evaluation design. Leveraging multiple large language models (LLMs) so users can choose to work with models they are more familiar with, agents can be evaluated through a variety of available LLMs in the market.

... continue reading