Tests Show That Top AI Models Are Making Disastrous Errors When Used for Journalism

Many media executives are betting the future of the industry on artificial intelligence, going as far as replacing journalists in an effort to keep costs down and cash in on the hype.

The result of these efforts so far has left a lot to be desired. We've come across countless examples of publications inadvertently publishing garbled AI slop, infuriating readers and journalists alike.

AI's persistent hallucinations are already infecting large swathes of our online lives, from Google's hilariously terrible AI Overviews mangling trustworthy information to brainrot gambling content appearing in newspapers to entire AI slop farms that blatantly rip off real journalists' work.

Worse yet, Google's embrace of the tech is actively hurting the bottom lines of publications by keeping readers — and with them, much-needed membership and display ad revenue — away from the content their AI is monetizing.

Meanwhile, journalists themselves are finding out the hard way that AI is woefully inadequate at meaningfully helping them out in their day-to-day work.

As a team led by award-winning New York University journalism professor Hilke Schellmann found in a new investigation published by the Columbia Journalism Review, AI is strikingly terrible at summarizing documents and scientific research for busy reporters who might be tempted to rely on the tech.

Schellman and her colleagues created a new test to evaluate the "journalistic values of accuracy and truth," finding that while most currently available AI models, including Google's Gemini 2.5 Pro and OpenAI's GPT-4o — which is still avialable to paying customers following the release of GPT-5 after OpenAI scrapped plans to pull it down — successfully generated short summaries of meeting transcripts and minutes from local government meetings with "almost no hallucations."

However, the AIs systematically "underperformed against the human benchmark in generating accurate long summaries" of around 500 words, failing to include roughly half the facts included in the transcripts and minutes. Hallucinations were also a bigger issue in the long summaries than the short ones.

The tech's shortcomings were far more egregious when it came to conducting research on behalf of science reporters. The team tasked five top AI research tools with generating a list of related scientific papers for four academic papers, with results that ranged from "underwhelming" to "alarming."

"None of the tools produced literature reviews with significant overlap to the benchmark papers, except for one test with Semantic Scholar, where it matched about 50 percent of citations," Schellman wrote. "Across all four tests, most tools identified less than 6 percent of the same papers cited in the human-authored reviews, and often 0 percent."

... continue reading