Images for AI use can be sourced responsibly

The team that built the FIBHE data set asked participants for their consent and compensated them for their images — something that doesn’t happen when AI tools just ‘scrape’ information from the Internet.Credit: Reka Olga/Getty

It’s a truth almost universally acknowledged that widely used generative artificial-intelligence applications were built with data collected from the Internet. This was done, for the most part, without obtaining people’s informed consent and without compensating the individuals whose data were ‘scraped’ in this way.

But a research article now shows that, when it comes to images, another way is possible. Researchers at the global technology and entertainment giant Sony describe a data set of responsibly sourced images that can be used to benchmark the accuracy of generative AI (A. Xiang et al. Nature https://doi.org/10.1038/s41586-025-09716-2; 2025). The work was complex, yet it didn’t cost the Earth. The price tag for data collection — less than US$1 million — is a drop in the ocean for many technology firms.

Read the paper: Fair human-centric image dataset for ethical AI benchmarking

Regulators and funders need to take note. So should all those involved in litigation relating to whether scraping people’s data — in any form — to train and test generative-AI models is permissible. Creating responsibly sourced and representative data is possible when consent and accuracy concerns are addressed explicitly.

There’s an important message for corporations, too: here is an opportunity for companies to work together for everyone’s benefit. There are times when firms need to compete and times when they must collaborate. In these pages, we often make the case for improved collaboration. If there was ever an example of why such partnerships are needed, this is it.

There’s little doubt that personal, sometimes identifiable, digital information has been used to build generative AI applications. Such data include material from blogs and content on social-media platforms, images and videos that often include people, and copyrighted works such as paintings and sculptures, books, music and films.

Don’t sleepwalk from computer-vision research into surveillance

Most countries have laws governing data collection (T. Kuru Int. Data Priv. Law 14, 326–351; 2024). These laws include the need to obtain permission to protect people’s privacy and intellectual-property rights. Those permissions often require those collecting data to explain what the data will be used for, include the ability to opt out and, when appropriate, compensate the people who provide data. Despite this, the companies developing some of the largest of the publicly available large language models have not routinely followed this practice. In some cases, firms have argued that consent isn’t needed if someone has already made their material available on the Internet, and that what they are doing constitutes ‘fair use’ of publicly available data. This is a controversial contention and is being questioned by regulatory bodies and organizations that represent copyright holders, such as writers and artists.

This is where the fresh data set — called the Fair Human-Centric Image Benchmark (FHIBE) or ‘Feebee’ — is different. Alice Xiang, Sony’s global head of AI governance, and her colleagues obtained informed consent for the data set’s 10,318 images of 1,981 individuals from 81 countries. Each individual was told in accessible language what data were needed and how they could be used — applications involving law enforcement, the military, arms and surveillance are explicitly prohibited under the terms of use. Participants were paid for their material and can opt out at any time.

... continue reading