Skip to content
Tech News
← Back to articles

Bots are scraping open data — how should researchers respond?

read original more articles

90% of open access data repositories part of the Confederation of Open Access Repositories encounter bot scraping,Credit: fdmsd8yea/Getty

Should researchers still be posting their data openly online? It’s a question being debated by some researchers now that bots are routinely mining open-access databases and scientific publications to train artificial-intelligence tools — and in some cases analysing and combining data sets to churn out new results and papers faster than humans can.

Some researchers argue that the potential of automated science to be used for scientific ‘good’ — speeding up the discovery of new drug targets, for example — means that open data should remain open. But others point to evidence that bots scraping complex data sets can contribute to low-quality research and AI slop, while also allowing the extraction of sensitive data, including patient information. They argue that new rules and technical systems are needed to restrict bot access to databases.

“It’s a pretty big issue everybody should be thinking about, whether you’re for or against AI,” says Andrea Howard, a psychologist at Carleton University in Ottawa, Canada.

Privacy concerns

What is clear is that AI scraping is common. A survey published in June last year by the Confederation of Open Access Repositories found that more than 90% of the member organizations that responded encounter bot scraping, with most of them seeing abnormally high bot activity at least once a week1. Often, that scraping is done to provide training data for AI models. Those data are also being used to produce new research outputs that are generated entirely by artificial-intelligence models.

“The scope and speed of how quickly automated pipelines can exhaust the research questions a data set can answer feels like a big change,” says Miri Forbes, a quantitative psychopathologist at Macquarie University in Sydney, Australia. “It shrinks the space left to work in a given data set.”

Debate on academic freedom and open access is healthy

Last month, Forbes kicked off a discussion about open data sharing on the social-media platform Bluesky. The responses were divided. “Sharing information freely means ceding control and accepting that it may be used for any purpose, including those I don't like,” responded one user on Bluesky. “It’s not your data anyway,” posted another.

Other people were less sanguine, pointing to a need for additional safeguards. “As a scientific community we need to solve this. We can’t have people fearing being scooped by AI,” posted one user.

... continue reading