AI research papers are getting better, and it’s a big problem for scientists

Last summer, Peter Degen’s postdoctoral supervisor came to him with an unusual problem: One of his papers was being cited too much. Citations are the currency of academia, but there was something unusual about these. Published in 2017, the paper had assessed the accuracy of a particular type of statistical analysis on epidemiological data and had received a respectable few dozen citations in other research papers over the years, but now it was being referenced every few days, hundreds of times, placing it among the most cited papers of his career. Another professor might be thrilled. Degen’s adviser asked him to investigate.

Degen, a postdoctoral researcher at the University of Zurich Center for Reproducible Science and Research Synthesis, found that the citing papers all followed a similar pattern. Like the original, they were analyzing the Global Burden of Disease study, a publicly available dataset compiled by the Institute for Health Metrics and Evaluation at the University of Washington. But they were using the dataset to churn out a seemingly endless supply of predictions: about the future likelihood of stroke among adults over 20 years old, of testicular cancer among young adults, of falls among elderly people in China, of colorectal cancer among people who eat minimal whole grains, of disease X among population Y, and so on.

Searching on GitHub for code that would be used to do this sort of analysis, Degen followed some links and wound up on the Chinese social media site Bilibili, where he discovered a Guangzhou-based company touting tutorials on how to produce publishable research in under two hours using its software tools and AI writing assistance. These studies were not very good. Researchers who analyzed a subset of studies about headaches found they were rife with errors and misrepresentations. But they were also not as flagrantly wrong as AI-generated papers of the recent past, making them more difficult to filter out.

“It’s a huge burden on the peer-review system, which is already at the limit,” Degen said. “There’s just too many papers being published and there’s not enough peer reviewers, and if the LLMs make it so much easier to mass produce papers, then this will reach a breaking point.”

Optimists about generative AI have high hopes for its ability to produce future scientific breakthroughs — accelerating discovery, eliminating most types of cancer — but the technology is currently undermining one of the pillars of scientific research, inundating editors and reviewers with an endless stream of papers. Paradoxically, the better the technology gets at producing competent papers, the worse the crisis becomes.

For the past decade, academic publishing has been contending with so-called “paper mills,” black-market companies that mass-produce papers and sell authorship slots to academics, doctors, or others who hope to gain a competitive edge by having published research on their resumes. It has been a game of cat and mouse, with publishers — often pressed by so-called science sleuths, researchers who specialize in ferreting out fraudulent research — closing one vulnerability only to have the mills find a new one. Generative AI was a boon to the mills, helping them to skirt plagiarism detectors by creating wholly new images and text. Still, the technology’s telltale hallucinations meant that publishers could at least theoretically screen out much of their work. In practice, papers still got through, only to get retracted when sleuths encountered a diagram of a rat with inexplicably gargantuan genitals labeled “testtomcels” or prose sprinkled with “as an AI assistant”s that someone forgot to delete.

But now AI has improved to the point where it can produce convincing papers almost wholesale, allowing desperate academics in need of a publication to mill papers of their own. The result is a deluge of scientific slop that threatens to swamp publishing, peer review, grant making, and the research system as it exists today.

Matt Spick, a lecturer in health and biomedical data analytics at the University of Surrey and an associate editor at Scientific Reports, first noticed the phenomenon when he received three strikingly similar papers analyzing the US National Health and Nutrition Examination Survey (NHANES), another public dataset. He checked Google Scholar and realized that it wasn’t a coincidence: There had been a sudden explosion in papers citing NHANES that all followed a similar formula, each purporting to discover an association between, for example, eating walnuts and cognitive function or drinking skim milk and depression.

“If you’ve got enough computing power, you go through and you measure every single pairwise association, and eventually you find some that haven’t been written on before and you just publish: There is a correlation between this and that,” Spick said. These correlations are often misleading simplifications of phenomena with multiple causes or random statistical flukes. “One was that how many years you spend in education will cause postoperative hernia complications. That is just a random correlation. What am I supposed to do with that? Leave school early so that I won’t get a postoperative hernia complication later?”

Over the years, sleuths have developed a variety of methods for detecting inauthentic papers. Some search for “tortured phrases,” instances where someone was trying to skirt plagiarism detectors by feeding an existing paper through a synonym generator, which often has the effect of turning technical terms like “reinforcement learning” into nonsense like “reinforcement getting to know,” to cite one recent example. Other sleuths track duplicated images, perform network analysis of authors, or check citations for hallucinated publications, a classic sign of LLM use. Spick searches for masses of papers following the same template as they analyze public datasets.

... continue reading