Tech News
← Back to articles

AI reviewers are here — we are not ready

read original related products more articles

Preprint sites have long served as the agile speedboat to the lumbering tanker of conventional academic publishing, and that agility allows for bold experimentation. The latest experiment of openRxiv, the non-profit organization in New York City that runs the bioRxiv and medRxiv repositories, is perhaps its most provocative yet.

AI is transforming peer review — and many scientists are worried

Last month, openRxiv announced that it was integrating a reviewing tool driven by artificial intelligence into its preprint sites. The tool, from the start-up company q.e.d Science in Tel Aviv, Israel, offers rapid AI-generated feedback (typically within 30 minutes) on biomedical manuscripts — judging originality, identifying logical gaps and suggesting more experiments and tweaks to the text.

The allure of an AI reviewer is undeniable. For any scientist who has languished for months awaiting a decision, or decoded a snarky remark from a hostile ‘Reviewer #2’, an algorithmic alternative sounds like the efficiency upgrade that publishing desperately needs. Large language models (LLMs) can provide feedback in seconds, potentially without conflicts of interest. But there is a big difference between a process that is efficient and one that is valid. As the scientific community starts to embrace AI technology, it must make sure it isn’t solving a logistical problem by creating an intellectual one.

Peer review has two purposes. It must validate the majority of routine scientific work — careful studies that test predictions and fill gaps in understanding — by scrutinizing statistics, methods and logical coherence. It must also recognize the rare discovery that introduces anomalous findings or challenges established frameworks, by assessing not whether the rules were followed, but whether they still apply.

Humans can, at least in principle, perform both functions. AI might not. LLMs can check statistics, catch plagiarism and verify citations; this contribution alone could be transformative. If routine work is offloaded to a computer, human attention — the scarcest resource in science — can be reserved for what matters most. But LLMs have limits, too. Trust an AI reviewer beyond those guard rails, and it might quickly become a liability.

Scientists hide messages in papers to game AI peer review

The first problem is one of regression to the mean. Human peer review can be an exercise in statistical sampling in which (usually) three specialists provide distinct data points, with the editor attempting a convergence. AI collapses this process: rather than sampling, it outputs the average reviewer’s assessment. By using GPT-4 to generate comments on papers, a 2024 study confirmed that the LLM was excellent at predicting what the average reviewer would say (W. Liang et al. NEJM AI https://doi.org/g88s5h; 2024). But there’s more to this than finding the average.