Tech News
← Back to articles

Large language models are biased — local initiatives are fighting for change

read original related products more articles

Omar Florez (right) is part of a team training the large language model Latam-GPT.Credit: CENIA

In early 2023, Álvaro Soto was looking for Las doce figuras del mundo, a short story co-written by one of his favourite authors, Jorge Luis Borges. To track down the book in which the story was originally published in 1942, Soto enlisted the help of artificial-intelligence chatbot ChatGPT. But the response did not sit right with Soto. “This is wrong,” he thought.

Nature Outlook: Robotics and artificial intelligence

The book ChatGPT proposed was one that Soto knew, and he was sure that the story — which translates as The Twelve Figures of the World — was not in it. For him, this is just one example of how AI technology often fails to grasp nuance, especially that which depends on cultural context. “These models were not trained with quality data from our region,” says Soto, a computer scientist and director of Chile’s National Center for Artificial Intelligence (CENIA) in Santiago. “When they don’t find specific information, they make it up.”

Like many computer and data scientists, Soto thinks that AI will drive a technological revolution, just as the Internet did three decades ago. As AI-based products flood the global market, it’s important to him that no one in Latin America — including those who speak minority languages, such as Rapa Nui spoken by those on Chile’s Easter Island — are left behind.

Roughly 7,000 languages exist worldwide but fewer than 5% are meaningfully represented online. Languages spoken by communities in Asia, Africa and the Americas account for about 5,500 of the total. Many of these are spoken by small populations and are at risk of disappearing by the end of the century. English, Spanish and French dominate in these regions as a result of their colonial history. AI systems are built mostly on data using these widely spoken languages, especially English, for which there are plenty of tools and data needed for natural-language processing.

This imbalance can hinder AI’s global reach. “Many of us live in multicultural societies, and many of our grandparents don’t speak English or type on a computer the way we do,” says Leslie Teo, a data scientist and a director of AI products at AI Singapore, a programme launched to boost the country’s AI capabilities. “If people don’t feel the AI understands them, or if they can’t access it, they won’t benefit from it,” he says.

Developers often say that their chatbot services can converse in a wide variety of languages. But the responses often have telltale signs that their core training was in English. “To deploy AI-driven technologies in our communities, they need to speak a language and context people are comfortable with,” says Mpho Primus, a computational linguist and co-director of the Institute for Artificial Intelligent Systems at the University of Johannesburg in South Africa. For her, that means going further than simply line-by-line translation; AI should also reflect culturally specific knowledge and norms. “How I speak to my mother-in-law is a very different way and use of words than how I speak to my mother,” she says.

Across Africa, Latin America and southeast Asia, researchers are collating data sets tailored to local cultures and languages that can be used to train AI systems. And some are already using these to build models locally. But because the architectures underlying these models are often developed in the United States, it remains to be seen what degree of cultural bias might still be present. “The global push to develop AI is no longer just about computing power or algorithmic breakthroughs,” wrote Primus in Nature Africa in June1. “It’s about who gets to speak and who gets left out in the digital future.”

Fluent, but not local

... continue reading