Tech News
← Back to articles

‘A serious problem’: peer reviews created using AI can avoid detection

read original related products more articles

The difficulty of detecting AI-tool use in peer review is proving problematic.Credit: BrianAJackson/iStock via Getty

It’s almost impossible to know whether a peer-review report has been generated by artificial intelligence, according to a study that put AI-detecting tools to the test.

A research team based in China used the Claude 2.0 large language model (LLM), created by Anthropic, an AI company in San Francisco, California, to generate peer-review reports and other types of documentation for 20 published cancer-biology papers from the journal eLife1. The journal’s publisher makes papers freely available online as ‘reviewed preprints’, and publishes them alongside their referee reports and the original unedited manuscripts.

The authors fed the original versions into Claude and prompted it to generate referee reports. The team then compared the AI-generated reports with the genuine ones published by eLife.

The AI-written reviews “looked professional, but had no specific, deep feedback”, says Lingxuan Zhu, an oncologist at the Southern Medical University in Lianyungang, China, and a co-author of the study. “This made us realize that there was a serious problem.”

The study found that Claude could write plausible citation requests (suggesting papers that authors could add to their reference lists) and convincing rejection recommendations (made when reviewers think a journal should reject a submitted paper). The latter capability raises the risk of journals rejecting good papers, says Zhu. “An editor cannot be an expert in everything. If they receive a very persuasive AI-written negative review, it could easily influence their decision.”

The study also found that the majority of the AI reports fooled the detection tools: ZeroGPT erroneously classified 60% as written by a human, and GPTzero concluded this for more than 80%.

Differing opinions

A growing challenge for journals is the fact that LLMs could be used in many ways to produce a referee report. What is deemed an ‘acceptable’ use of AI also differs depending on whom you ask. In a survey of some 5,000 researchers conducted by Nature earlier this year, 66% of respondents said it wasn’t appropriate to use generative AI to create reviewer reports from scratch. But 57% said it was acceptable to use it to help with peer review by getting it to answer questions about papers.

And although AI-detection tools are improving, they struggle to determine how much of a document has been generated using AI. An analysis published last year of referee reports that were submitted to four computer-science conferences estimated that 17% had been substantially modified by chatbots2. It’s not clear, however, whether the referees used AI to improve the reports or to write them entirely.

... continue reading