Latest Tech News

Stay updated with the latest in technology, AI, cybersecurity, and more

Filtered by: benchmark Clear Filter

AI agent benchmarks are broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no

AI Agent Benchmarks Are Broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no

Former Intel CEO launches a benchmark to measure AI alignment

In Brief After former Intel CEO Pat Gelsinger capped off a 40+ year career at the semiconductor giant in December, many wondered where would Gelsinger go next. On Thursday, the former Intel CEO revealed one piece of his next chapter: trying to ensure AI models support a flourishing humanity. In partnership with a “faith tech” company he first invested in roughly 10 years ago called Gloo, Gelsinger launched a new benchmark — Flourishing AI, or FAI — to test how well AI models align with certain

Koala: A benchmark suite for performance-oriented shell-optimization research

The Koala Benchmark Suite Benchmarks | Quick Setup | More Info | License For issues and ideas, open a GitHub issue. Koala is a benchmark suite aimed at the characterization of performance-oriented research targeting the POSIX shell. It consists of 14 sets of real-world shell programs from diverse domains ranging from CI/CD and AI/ML to biology and the humanities. They are accompanied by real inputs that facilitate small- and large-scale performance characterization and varying opportunities f

Benchmarking Postgres

Want to learn more about unlimited IOPS w/ Metal, Vitess, horizontal sharding, or Enterprise options? Benchmarking Postgres By Benjamin Dicken | July 1, 2025 Today we launched PlanetScale for Postgres. For the past several months, we've been laser focused on building the best Postgres experience on the planet, performance included. To ensure we met our high standard for database performance, we needed a way to measure and compare other options with a standardized, repeatable, and fair method

Slightly better named character reference tokenization than Chrome, Safari, FF

Slightly better named character reference tokenization than Chrome, Safari, and Firefox 2025-06-26 Note: I am not a 'browser engine' person, nor a 'data structures' person. I'm certain that an even better implementation than what I came up with is very possible. A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I to

Can we fix AI’s evaluation crisis?

However, there are a growing number of teams around the world trying to address the AI evaluation crisis. One result is a new benchmark called LiveCodeBench Pro. It draws problems from international algorithmic olympiads—competitions for elite high school and university programmers where participants solve challenging problems without external tools. The top AI models currently manage only about 53% at first pass on medium-difficulty problems and 0% on the hardest ones. These are tasks where hu

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more French AI darling Mistral is keeping the new releases coming this summer. Just days after announcing its own domestic AI-optimized cloud service Mistral Compute, the well-funded company has released an update to its 24B parameter open source model Mistral Small, jumping from a 3.1 release to 3.2-24B Instruct-2506. The new version builds

Gigabyte Radeon RX 9060 XT Review: Great Value Gaming

It's AMD's turn. After months of $2,000+ GPUs and long discussions of DLSS, we're finally on the red team's turf. AMD's strength historically lies at the budget end of the spectrum, where the majority of gamers are playing at 1080p, and spending $1,000 or less for their entire system. Even though we really recommend splurging on a GPU, that's just not the reality for most folks. An $800 GPU needs $1,200 in other parts, and at that point most people who aren't into PC gaming will start shopping