Skip to content
Tech News
← Back to articles

Scientific datasets are riddled with copy-paste errors

read original get Data Cleaning Software Suite → more articles
Why This Matters

This discovery highlights the critical need for rigorous data validation in scientific research, especially as open-access datasets become more prevalent. Copy-paste errors in influential studies can undermine trust, delay progress, and potentially lead to false conclusions in the tech and medical industries. Ensuring data integrity is essential for advancing reliable scientific knowledge and maintaining public confidence.

Key Takeaways

The above data comes from a landmark paper in Parkinson's Disease research, which provided the first-ever evidence that the disease originates in the gut rather than the brain. The paper received media coverage from major outlets and has amassed over 3000 citations from other scientific papers. But the underlying data contains sequences of duplicated values that should belong to completely different individual mice. The dataset has been publicly available on Dryad - an open-access repository where scientists upload their raw data - for more than 8 years. Why didn't anyone notice the blatant copy-paste errors until now?

Before going into more detail about this case, let me give some background on how we detected the issue: It was flagged by a piece of software I started building last year, which was inspired by two cases of data fabrication that made the news in recent years. One by Nobel laureate Thomas Südhof's lab and one by spider ecologist Jonathan Pruitt. Both cases had publicly available datasets with entire blocks of copy-pasted data that seemed quite trivial to detect. I was curious what I could dig up by creating a program that would correctly flag those cases, and then unleash it on all datasets available in open-access repositories.

Together with a few volunteer contributors, we've finished reporting all cases from the first 600 datasets we've scanned. There were 18 cases we felt were serious enough to raise concerns. Here are 3 of the most exciting ones:

Case 1: A seminal Parkinson's paper

We took mice that were genetically predisposed to developing symptoms of Parkinson's and we just cleared out their microbiome - all their symptoms went away.

That's how the paper's senior author Sarkis Mazmanian summarized the findings of the study when he went on the Rich Roll Youtube channel:

These are the results he is referring to:

Graphs B, C, and D contain measurements of the motor function of different groups of mice. They tell a clean story where the ASO mice (genetically predisposed to developing the mouse model Parkinson's disease) take longer to complete the tasks than regular wild-type (WT) mice only when they have retained a normal gut microbiome (SPF, Ex-GF) - not in mice stripped of their microbiome (GF, Abx).

Diagram showing which results are affected by each block of duplicate values.

There are two issues with the data:

... continue reading