MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

The adoption of interoperability standards, such as the Model Context Protocol (MCP), can provide enterprises with insights into how agents and models function outside their walled confines. However, many benchmarks fail to capture real-life interactions with MCP.

Salesforce AI Research developed a new open-source benchmark it calls MCP-Universe, which aims to track LLMs as these interact with MCP servers in the real world, arguing that it will paint a better picture of real-life and real-time interactions of models with tools enterprises actually use. In its initial testing, it found that models like OpenAI’s recently released GPT-5 are strong, but still do not perform as well in real-life scenarios.

“Existing benchmarks predominantly focus on isolated aspects of LLM performance, such as instruction following, math reasoning, or function calling, without providing a comprehensive assessment of how models interact with real-world MCP servers across diverse scenarios,” Salesforce said in a paper.

MCP-Universe captures model performance through tool usage, multi-turn tool calls, long context windows and large tool spaces. It’s grounded on existing MCP servers with access to actual data sources and environments.

AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Turning energy into a strategic advantage

Architecting efficient inference for real throughput gains

Unlocking competitive ROI with sustainable AI systems Secure your spot to stay ahead: https://bit.ly/4mwGngO

Junnan Li, director of AI research at Salesforce, told VentureBeat that many models “still face limitations that hold them back on enterprise-grade tasks.”

“Two of the biggest are: Long context challenges, models can lose track of information or struggle to reason consistently when handling very long or complex inputs,” Li said. “And, Unknown tool challenges, models often aren’t able to seamlessly use unfamiliar tools or systems in the way humans can adapt on the fly. This is why it’s crucial not to take a DIY approach with a single model to power agents alone, but instead, to rely on a platform that combines data context, enhanced reasoning, and trust guardrails to truly meet the needs of enterprise AI.”

... continue reading