Drowning in data sets? Here’s how to cut them down to size

Within the next decade, a pair of giant radio telescopes in South Africa and Australia will be able to generate about 700 petabytes of data each year, the equivalent of about 149 million DVDs, a stack nearly 180 kilometres high.

The telescopes are part of the Square Kilometre Array Observatory (SKAO), which will include more than 100,000 Christmas-tree-like wire antennas in Australia and some 200 dishes in South Africa when it is completed in 2029. These telescopes will pick up radio signals from celestial objects, and their developers hope that they will shed light on some of astronomy’s long-standing questions, such as what dark matter is and how galaxies form.

Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia

But 700 petabytes is only about 1% of the data that the array could generate. Shari Breen, head of science operations at the SKAO in Jodrell Bank, UK, estimates that it could produce some 60 exabytes — 60,000 petabytes — each year if researchers used all of its systems continuously and retained all of the data.

“The amount of money that it would take to hold our rawest forms of data is insane — I don’t even know where we would fit that many computers,” says Breen. “So, we have to make some compromises.”

Disciplines such as astronomy and the Earth and biological sciences have long grappled with unwieldy data sets. As the volume, processing speed and variety of data continue to grow, the storage capacity is struggling to keep pace. At the same time, the boom in machine-learning and artificial-intelligence technologies is creating an incentive to hoard information. But unconstrained data retention is not financially viable and uses a great deal of energy.

“This is a problem that libraries have been dealing with for as long as libraries have existed,” says Kristin Briney, a librarian at the California Institute of Technology (Caltech) in Pasadena. “We cannot physically collect all the books that we want to collect, and in 50 years, the book may not be useful any more.”

Data sets, she says, are the same. “There has to be some curation that determines what is worth keeping and what is worth throwing away.”

Field-specific rules

There is no one-size-fits-all rulebook for data curation, and best practice often depends on the discipline and on the scale of a project.

... continue reading