Dozens of AI disease-prediction models were trained on dubious data

Credit: Marko Nikolic/Alamy

Dubious data sets are being used to train artificial-intelligence models that are designed to predict people’s risk of stroke and diabetes, researchers report in a preprint1 on medRxiv. Some of the models seem to have been used in clinical settings although it’s not clear whether this has led to flawed diagnoses. At least two journals are investigating studies that used these data sets.

Adrian Barnett, a statistician at the Queensland University of Technology in Brisbane, Australia, and his colleagues identified 124 peer-reviewed papers that report using one of two open-access health data sets to train machine-learning models that provide little information about where the data came from.

An analysis revealed multiple oddities that would not be expected for data from real people, leading Barnett and his colleagues to suspect that the data could have been fabricated. “It was an enormous surprise to come across something like that,” Barnett says.

At least two of the models have been used in in hospitals in Indonesia and Spain. One has also been documented in a medical-device patent application filed in 2024, and two are publicly available web tools that allow people to check their risk level by uploading information about themselves.

“Prediction models trained on provenance-unknown data have no place in clinical decision-making. They are intrinsically unreliable,” says Soumyadeep Bhaumik, a public-health researcher at the George Institute for Global Health in Sydney, Australia. If the tools do not use real-world data, they are likely to make incorrect predictions and lead clinicians to make inappropriate decisions, such as prescribing treatments unnecessarily or not prescribing them when it is needed, he says.

Institutions and funders must insist that researchers disclose the source of data used to train AI models for medical applications, and journals should reject papers that fail this requirement, says Bhaumik. Barnett says that the data sets flagged in the study should now be taken down to prevent further studies from using them.

Data sharing

The two data sets investigated in the study, which has not yet been peer reviewed, were uploaded to Kaggle, a platform that developers can use to access data sets for building machine-learning models.

The first, labelled Stroke Prediction Dataset, was uploaded with the description “11 clinical features for predicting stroke events”. It contains health information from 5,110 people, including data on risk factors such as history of heart disease, marital status, average blood glucose level and body mass index (BMI). But when the researchers plotted the average blood glucose level against participant identifiers, they found several irregularities.

... continue reading