Skip to content
Tech News
← Back to articles

Drowning in data sets? Here’s how to cut them down to size

read original get Data Cleaning Toolkit → more articles
Why This Matters

The exponential growth of data generated by advanced telescopes like SKAO highlights the urgent need for innovative data management and storage solutions in the tech industry. Efficiently handling massive datasets is crucial for scientific progress, cost management, and energy conservation, impacting both researchers and consumers relying on data-driven technologies.

Key Takeaways

Within the next decade, a pair of giant radio telescopes in South Africa and Australia will be able to generate about 700 petabytes of data each year, the equivalent of about 149 million DVDs, a stack nearly 180 kilometres high.

The telescopes are part of the Square Kilometre Array Observatory (SKAO), which will include more than 100,000 Christmas-tree-like wire antennas in Australia and some 200 dishes in South Africa when it is completed in 2029. These telescopes will pick up radio signals from celestial objects, and their developers hope that they will shed light on some of astronomy’s long-standing questions, such as what dark matter is and how galaxies form.

Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia

But 700 petabytes is only about 1% of the data that the array could generate. Shari Breen, head of science operations at the SKAO in Jodrell Bank, UK, estimates that it could produce some 60 exabytes — 60,000 petabytes — each year if researchers used all of its systems continuously and retained all of the data.

“The amount of money that it would take to hold our rawest forms of data is insane — I don’t even know where we would fit that many computers,” says Breen. “So, we have to make some compromises.”

Disciplines such as astronomy and the Earth and biological sciences have long grappled with unwieldy data sets. As the volume, processing speed and variety of data continue to grow, the storage capacity is struggling to keep pace. At the same time, the boom in machine-learning and artificial-intelligence technologies is creating an incentive to hoard information. But unconstrained data retention is not financially viable and uses a great deal of energy.

“This is a problem that libraries have been dealing with for as long as libraries have existed,” says Kristin Briney, a librarian at the California Institute of Technology (Caltech) in Pasadena. “We cannot physically collect all the books that we want to collect, and in 50 years, the book may not be useful any more.”

Data sets, she says, are the same. “There has to be some curation that determines what is worth keeping and what is worth throwing away.”

Field-specific rules

There is no one-size-fits-all rulebook for data curation, and best practice often depends on the discipline and on the scale of a project.

... continue reading