Latest Tech News

Stay updated with the latest in technology, AI, cybersecurity, and more

Filtered by: bench Clear Filter

AI coding tools are shifting to a surprising place: The terminal

For years, code-editing tools like Cursor, Windsurf, and GitHub’s Copilot have been the standard for AI-powered software development. But as agentic AI grows more powerful and vibe coding takes off, a subtle shift has changed how AI systems are interacting with software. Instead of working on code, they’re increasingly interacting directly with the shell of whatever system they’re installed in. It’s a significant change in how AI-powered software development happens — and despite the low profil

AI agent benchmarks are broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no

AI Agent Benchmarks Are Broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no

Former Intel CEO launches a benchmark to measure AI alignment

In Brief After former Intel CEO Pat Gelsinger capped off a 40+ year career at the semiconductor giant in December, many wondered where would Gelsinger go next. On Thursday, the former Intel CEO revealed one piece of his next chapter: trying to ensure AI models support a flourishing humanity. In partnership with a “faith tech” company he first invested in roughly 10 years ago called Gloo, Gelsinger launched a new benchmark — Flourishing AI, or FAI — to test how well AI models align with certain

Koala: A benchmark suite for performance-oriented shell-optimization research

The Koala Benchmark Suite Benchmarks | Quick Setup | More Info | License For issues and ideas, open a GitHub issue. Koala is a benchmark suite aimed at the characterization of performance-oriented research targeting the POSIX shell. It consists of 14 sets of real-world shell programs from diverse domains ranging from CI/CD and AI/ML to biology and the humanities. They are accompanied by real inputs that facilitate small- and large-scale performance characterization and varying opportunities f

I recommend this Windows laptop to creatives and pro users - and it's on sale for Prime Day

ZDNET's key takeaways MSI's new Raider 18 HX is retailing for $4,000. This is, without a doubt, the most powerful laptop I've tested in 2025, due to its Intel Core Ultra 9 CPU and GeForce RTX 5080 GPU. As you can imagine, it is quite expensive and rather heavy. View now at New Egg View now at Amazon View now at B&H Photo Video more buying choices As part of Amazon Prime Day, the MSI Raider 18 HX with the GeForce RTX 5090 graphics card is on sale for $4,300, a $900 discount. 2025 has been a b

Benchmarking Postgres

Want to learn more about unlimited IOPS w/ Metal, Vitess, horizontal sharding, or Enterprise options? Benchmarking Postgres By Benjamin Dicken | July 1, 2025 Today we launched PlanetScale for Postgres. For the past several months, we've been laser focused on building the best Postgres experience on the planet, performance included. To ensure we met our high standard for database performance, we needed a way to measure and compare other options with a standardized, repeatable, and fair method

Slightly better named character reference tokenization than Chrome, Safari, FF

Slightly better named character reference tokenization than Chrome, Safari, and Firefox 2025-06-26 Note: I am not a 'browser engine' person, nor a 'data structures' person. I'm certain that an even better implementation than what I came up with is very possible. A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I to

Can we fix AI’s evaluation crisis?

However, there are a growing number of teams around the world trying to address the AI evaluation crisis. One result is a new benchmark called LiveCodeBench Pro. It draws problems from international algorithmic olympiads—competitions for elite high school and university programmers where participants solve challenging problems without external tools. The top AI models currently manage only about 53% at first pass on medium-difficulty problems and 0% on the hardest ones. These are tasks where hu

A Chinese firm has just launched a constantly changing set of AI benchmarks

Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public. Xbench approached the problem with two different systems. One is similar to traditional benchmarking:

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more French AI darling Mistral is keeping the new releases coming this summer. Just days after announcing its own domestic AI-optimized cloud service Mistral Compute, the well-funded company has released an update to its 24B parameter open source model Mistral Small, jumping from a 3.1 release to 3.2-24B Instruct-2506. The new version builds

Gigabyte Radeon RX 9060 XT Review: Great Value Gaming

It's AMD's turn. After months of $2,000+ GPUs and long discussions of DLSS, we're finally on the red team's turf. AMD's strength historically lies at the budget end of the spectrum, where the majority of gamers are playing at 1080p, and spending $1,000 or less for their entire system. Even though we really recommend splurging on a GPU, that's just not the reality for most folks. An $800 GPU needs $1,200 in other parts, and at that point most people who aren't into PC gaming will start shopping

MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model

1. Model Overview We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. Consistent with MiniMax-Text-01, the M1 model natively supports a context length of 1 million

Chemical knowledge and reasoning of large language models vs. chemist expertise

Benchmark corpus To compile our benchmark corpus, we utilized a broad list of sources (Methods), ranging from completely novel, manually crafted questions over university exams to semi-automatically generated questions based on curated subsets of data in chemical databases. For quality assurance, all questions have been reviewed by at least two scientists in addition to the original curator and automated checks. Importantly, our large pool of questions encompasses a wide range of topics and que

Nvidia Arm chip surfaces with strong Geekbench scores, could rival top Intel and AMD laptop CPUs

Something to look forward to: A mysterious "Nvidia N1x" Arm processor recently appeared on Geekbench with a single-core score of 3,096 and an 18,837 multi-core score. The result provides some of the earliest concrete evidence of Nvidia's long-rumored upcoming SoC. Word is that Nvidia has been developing an Arm-based processor for the past few years. Despite a teaser released last year in collaboration with Dell and the January unveiling of Nvidia's Arm workstation, a consumer-level product has