Latest Tech News

Stay updated with the latest in technology, AI, cybersecurity, and more

Filtered by: bench Clear Filter

Tau² benchmark: How a prompt rewrite boosted GPT-5-mini by 22%

Now on the front page of Hacker News — join the discussion. In a recent post, we introduced the Tau² benchmark, a framework for benchmaring LLMs. Today we’re sharing a surprising discovery we made while using it: a simple prompt rewrite boosted a small model’s success rate by over 20%. This post is a deep-dive on how we found and fixed this performance bottleneck by making subtle changes to agent policies. Benchmarking LLMs with Tau² On the recent OpenAI Summer Update, we have seen that GPT-5

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

In a recent post, we introduced the Tau² benchmark, a framework for benchmaring LLMs. Today we’re sharing a surprising discovery we made while using it: a simple prompt rewrite boosted a small model’s success rate by over 20%. This post is a deep-dive on how we found and fixed this performance bottleneck by making subtle changes to agent policies. Benchmarking LLMs with Tau² On the recent OpenAI Summer Update, we have seen that GPT-5 model has made significant strides in agentic tasks. To vali

Crowdstrike and Meta just made evaluating AI security tools easier

fotograzia/Moment via Getty Images Follow ZDNET: Add us as a preferred source on Google. ZDNET's key takeaways AI is both a cybersecurity threat and a solution. Benchmarks will test LLMs for real-world cybersecurity tasks. The suite could help developers build better models. Overwhelmed with cybersecurity tool options? A new set of benchmark tests aims to help you evaluate them and find the right ones for you. Also: Navigating AI-powered cyber threats in 2025: 4 expert security tips for b

Why do browsers throttle JavaScript timers?

Posted August 31, 2025 by Nolan Lawson in performance, Web. Tagged: timers. 9 Comments Even if you’ve been doing JavaScript for a while, you might be surprised to learn that setTimeout(0) is not really setTimeout(0) . Instead, it could run 4 milliseconds later: const start = performance.now() setTimeout(() => { // Likely 4ms console.log(performance.now() - start) }, 0) Nearly a decade ago when I was on the Microsoft Edge team, it was explained to me that browsers did this to avoid “abuse.” I.

Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads

Authors: Marko Kabić, Bowen Wu, Jonas Dann, Gustavo Alonso Abstract In this study we explore the impact of different combinations of GPU models (RTX3090, A100, H100, GraceHoppers - GH200) and interconnects (PCIe 3.0, PCIe 4.0, PCIe 5.0, and NVLink 4.0) on various relational data analytics workloads (TPC-H, H2O-G, ClickBench). We present MaxBench, a comprehensive framework designed for benchmarking, profiling, and modeling these workloads on GPUs. Beyond delivering detailed performance metrics,

Worried about the Pixel 10 Pro XL benchmark controversy? Here’s why you shouldn’t be

Well, friends, the day has finally come: Google Pixel day. Today is the day that Google will officially reveal the Pixel 10 series after months of leaks, and it’s a day we should all be excited about. Instead, a lot of Pixel fans are upset and complaining. Those complaints stem from a last-minute Pixel 10 Pro XL benchmark leak, supposedly showing just how fast Google’s new Tensor G5 chip is. Assuming the benchmarks are legit, the good news is that the Tensor G5 is faster than the Tensor G4. How

Benchmarking GPT-5 on 400 real-world code reviews

GPT-5 is now available in Qodo’s platform for all free and paid users. Get started today. At Qodo, we believe benchmarks should reflect how developers actually work. That’s why we built the PR Benchmark—a benchmark designed to assess how well language models handle tasks like code review, suggesting improvements, and understanding developer intent. Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during train

Herbie detects inaccurate expressions and finds more accurate replacements

Herbie improving accuracy on the “Hamming” benchmark suite. Longer arrows are better. Each arrow starts at the accuracy of the original expression, and ends at the accuracy of Herbie’s output, in each case on random double-precision inputs. Herbie Project News The Herbie Developers Herbie is developed at UW PLSE, with contributions from a supportive community. The main contributors are Pavel Panchekha, Alex Sanchez-Stern, David Thien, Zachary Tatlock, Jason Qiu, Jack Firth, and James R. Wilc

Do LLMs identify fonts?

Spoiler: not really dafont.com is a wonderful website that contains a large collection of fonts. It’s more comprehensive and esoteric than Google Fonts. One of its features is a forum where users can ask for help identifying fonts – check out this poor fellow who’s been waiting for over two years and bumped his thread. I thought it would be interesting to see if an LLM could do this task, so I scraped the forum and set up a benchmark. I implemented this as a live benchmark. By this I mean that

Efficiently Generating a Number in a Range (2018)

The vast majority of my posts about random number generation have focused on looking at the properties of different generation schemes. But, perhaps surprisingly, the performance of your randomized algorithm may hinge not on the generation scheme you chose, but on other factors. In this post (inspired by and building on an excellent recent paper by Daniel Lemire), we'll explore a common source of overhead in random number generation that frequently outweighs PRNG engine performance. Imagine thi

Gemini 2.5 Deep Think

Today, we’re making Deep Think available in the Gemini app to Google AI Ultra subscribers – the latest in a lineup of extremely capable AI tools and features made exclusively available to them. This new release incorporates feedback from early trusted testers and research breakthroughs. It’s a significant improvement over what was first announced at I/O, as measured in terms of key benchmark improvements and trusted tester feedback. It is a variation of the model that recently achieved the gold

Deep Think in the Gemini app

Today, we’re making Deep Think available in the Gemini app to Google AI Ultra subscribers – the latest in a lineup of extremely capable AI tools and features made exclusively available to them. This new release incorporates feedback from early trusted testers and research breakthroughs. It’s a significant improvement over what was first announced at I/O, as measured in terms of key benchmark improvements and trusted tester feedback. It is a variation of the model that recently achieved the gold

Modernising the Amiga at Forty

Modernising the Amiga at Forty I'm writing this on the 23rd of July 2025. 40 years ago, The Amiga Personal Computer was revealed to the world at the Lincoln Centre in New York. Andy Warhol was invited to paint the famous Debbie Harry, live on stage with the Amiga 1000. You can watch the event on YouTube. Things have moved on quite a since then. The Amiga was quite probably, the first multimedia computer. It was pretty big here in Europe, especially in the UK and Germany in the 80s and 90s. Tod

VC Victor Lazarte is leaving Benchmark to launch his own firm

After two years as a venture capitalist for top shelf firm Benchmark, Victor Lazarte is leaving that company to start his own investing gig, he announced on X on Thursday. Lazarte became known in tech for co-founding mobile gaming company Wildlife Studios, which was last valued at an estimated $3 billion in 2020. Wildlife raised money from a multitude of VCs, including Benchmark. During his two years at Benchmark, Lazarte invested in recruiting and data labeling startup Mercor, AI video intell

A new AI coding challenge just published its first results — and they aren’t pretty

A new AI coding challenge has revealed its first winner — and set a new bar for AI-powered software engineers. On Wednesday at 5 p.m. PT, the nonprofit Laude Institute announced the first winner of the K Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andrade, who will receive $50,000 for the prize. But more surprising than the win was his final score: He won with correct

Topics: ai bench just prize swe

A new AI coding challenge just published its first results – and they aren’t pretty

A new AI coding challenge has revealed its first winner — and set a new bar for AI-powered software engineers. On Wednesday at 5pm PST, the nonprofit Laude Institute announced the first winner of the K Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andrade, who will receive $50,000 for the prize. But more surprising than the win was his final score: he won with correct an

Topics: ai bench just prize swe

AI coding tools are shifting to a surprising place: The terminal

For years, code-editing tools like Cursor, Windsurf, and GitHub’s Copilot have been the standard for AI-powered software development. But as agentic AI grows more powerful and vibe coding takes off, a subtle shift has changed how AI systems are interacting with software. Instead of working on code, they’re increasingly interacting directly with the shell of whatever system they’re installed in. It’s a significant change in how AI-powered software development happens — and despite the low profil

AI agent benchmarks are broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no

AI Agent Benchmarks Are Broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no

Former Intel CEO launches a benchmark to measure AI alignment

In Brief After former Intel CEO Pat Gelsinger capped off a 40+ year career at the semiconductor giant in December, many wondered where would Gelsinger go next. On Thursday, the former Intel CEO revealed one piece of his next chapter: trying to ensure AI models support a flourishing humanity. In partnership with a “faith tech” company he first invested in roughly 10 years ago called Gloo, Gelsinger launched a new benchmark — Flourishing AI, or FAI — to test how well AI models align with certain

Koala: A benchmark suite for performance-oriented shell-optimization research

The Koala Benchmark Suite Benchmarks | Quick Setup | More Info | License For issues and ideas, open a GitHub issue. Koala is a benchmark suite aimed at the characterization of performance-oriented research targeting the POSIX shell. It consists of 14 sets of real-world shell programs from diverse domains ranging from CI/CD and AI/ML to biology and the humanities. They are accompanied by real inputs that facilitate small- and large-scale performance characterization and varying opportunities f

I recommend this Windows laptop to creatives and pro users - and it's on sale for Prime Day

ZDNET's key takeaways MSI's new Raider 18 HX is retailing for $4,000. This is, without a doubt, the most powerful laptop I've tested in 2025, due to its Intel Core Ultra 9 CPU and GeForce RTX 5080 GPU. As you can imagine, it is quite expensive and rather heavy. View now at New Egg View now at Amazon View now at B&H Photo Video more buying choices As part of Amazon Prime Day, the MSI Raider 18 HX with the GeForce RTX 5090 graphics card is on sale for $4,300, a $900 discount. 2025 has been a b

Benchmarking Postgres

Want to learn more about unlimited IOPS w/ Metal, Vitess, horizontal sharding, or Enterprise options? Benchmarking Postgres By Benjamin Dicken | July 1, 2025 Today we launched PlanetScale for Postgres. For the past several months, we've been laser focused on building the best Postgres experience on the planet, performance included. To ensure we met our high standard for database performance, we needed a way to measure and compare other options with a standardized, repeatable, and fair method

Slightly better named character reference tokenization than Chrome, Safari, FF

Slightly better named character reference tokenization than Chrome, Safari, and Firefox 2025-06-26 Note: I am not a 'browser engine' person, nor a 'data structures' person. I'm certain that an even better implementation than what I came up with is very possible. A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I to

Can we fix AI’s evaluation crisis?

However, there are a growing number of teams around the world trying to address the AI evaluation crisis. One result is a new benchmark called LiveCodeBench Pro. It draws problems from international algorithmic olympiads—competitions for elite high school and university programmers where participants solve challenging problems without external tools. The top AI models currently manage only about 53% at first pass on medium-difficulty problems and 0% on the hardest ones. These are tasks where hu

A Chinese firm has just launched a constantly changing set of AI benchmarks

Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public. Xbench approached the problem with two different systems. One is similar to traditional benchmarking:

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more French AI darling Mistral is keeping the new releases coming this summer. Just days after announcing its own domestic AI-optimized cloud service Mistral Compute, the well-funded company has released an update to its 24B parameter open source model Mistral Small, jumping from a 3.1 release to 3.2-24B Instruct-2506. The new version builds

Gigabyte Radeon RX 9060 XT Review: Great Value Gaming

It's AMD's turn. After months of $2,000+ GPUs and long discussions of DLSS, we're finally on the red team's turf. AMD's strength historically lies at the budget end of the spectrum, where the majority of gamers are playing at 1080p, and spending $1,000 or less for their entire system. Even though we really recommend splurging on a GPU, that's just not the reality for most folks. An $800 GPU needs $1,200 in other parts, and at that point most people who aren't into PC gaming will start shopping