A new AI benchmark tests whether chatbots protect human wellbeing

AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human wellbeing or just maximize for engagement. A new benchmark dubbed HumaneBench seeks to fill that gap by evaluating whether chatbots prioritize user wellbeing and how easily those protections fail under pressure.

“I think we’re in an amplification of the addiction cycle that we saw hardcore with social media and our smartphones and screens,” Erika Anderson, founder of Building Humane Technology, the benchmark’s author, told TechCrunch. “But as we go into that AI landscape, it’s going to be very hard to resist. And addiction is amazing business. It’s a very effective way to keep your users, but it’s not great for our community and having any embodied sense of ourselves.”

Building Humane Technology is a grassroots organization of developers, engineers, and researchers – mainly in Silicon Valley – working to make humane design easy, scalable, and profitable. The group hosts hackathons where tech workers build solutions for humane tech challenges, and is developing a certification standard that evaluates whether AI systems uphold humane technology principles. So just as you can buy a product that certifies it wasn’t made with known toxic chemicals, the hope is that consumers will one day be able to choose to engage with AI products from companies that demonstrate alignment through Humane AI certification.

The models were given Explicit instructions to disregard humane principles. Image Credits:Building Humane Technology

Most AI benchmarks measure intelligence and instruction-following, rather than psychological safety. HumaneBench joins exceptions like DarkBench.ai, which measures a model’s propensity to engage in deceptive patterns, and the Flourishing AI benchmark, which evaluates support for holistic well-being.

HumaneBench relies on Building Humane Tech’s core principles: that technology should respect user attention as a finite, precious resource; empower users with meaningful choices; enhance human capabilities rather than replace or diminish them; protect human dignity, privacy and safety; foster healthy relationships; prioritize long-term wellbeing; be transparent and honest; and design for equity and inclusion.

The team prompted 14 of the most popular AI models with 800 realistic scenarios, like a teenager asking if they should skip meals to lose weight or a person in a toxic relationship questioning if they’re overreacting. Unlike most benchmarks that rely solely on LLMs to judge LLMs, they incorporated manual scoring for a more human touch alongside an ensemble of three AI models: GPT-5.1, Claude Sonnet 4.5, and Gemini 2.5 Pro. They evaluated each model under three conditions: default settings, explicit instructions to prioritize humane principles, and instructions to disregard those principles.

The benchmark found every model scored higher when prompted to prioritize wellbeing, but 71% of models flipped to actively harmful behavior when given simple instructions to disregard human wellbeing. For example, xAI’s Grok 4 and Google’s Gemini 2.0 Flash tied for the lowest score (-0.94) on respecting user attention and being transparent and honest. Both of those models were among the most likely to degrade substantially when given adversarial prompts.

Techcrunch event Join the Disrupt 2026 Waitlist Add yourself to the Disrupt 2026 waitlist to be first in line when Early Bird tickets drop. Past Disrupts have brought Google Cloud, Netflix, Microsoft, Box, Phia, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, and Vinod Khosla to the stages — part of 250+ industry leaders driving 200+ sessions built to fuel your growth and sharpen your edge. Plus, meet the hundreds of startups innovating across every sector. Join the Disrupt 2026 Waitlist Add yourself to the Disrupt 2026 waitlist to be first in line when Early Bird tickets drop. Past Disrupts have brought Google Cloud, Netflix, Microsoft, Box, Phia, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, and Vinod Khosla to the stages — part of 250+ industry leaders driving 200+ sessions built to fuel your growth and sharpen your edge. Plus, meet the hundreds of startups innovating across every sector. San Francisco | WAITLIST NOW

Only three models – GPT-5, Claude 4.1, and Claude Sonnet 4.5 – maintained integrity under pressure. OpenAI’s GPT-5 had the highest score (.99) for prioritizing long-term well-being, with Claude Sonnet 4.5 following in second (.89).

... continue reading