The supercomputing infrastructure at the University of Tarapacá (UTA) in Arica, Chile, is a fundamental pillar for Latam-GPT. With a projected investment of $10 million, the new center has a cluster of 12 nodes, each equipped with eight state-of-the-art NVIDIA H200 GPUs. This capacity, unprecedented in Chile and the region more broadly, not only enables large-scale model training in the country for the first time, it also encourages decentralization and energy efficiency.
The first version of Latam-GPT will be launched this year. The model will be refined and expanded as new strategic partners join the effort and more robust data sets are integrated into it.
The interview was edited for length and clarity.
WIRED: Tech giants such as Google, OpenAI, and Anthropic have invested billions in their models. What is the technical and strategic argument for the development of a separate model specifically for Latin America?
Álvaro Soto: Regardless of how powerful these other models may be, they are incapable of encompassing everything relevant to our reality. I feel that today they are too focused on the needs of other parts of the world. Imagine if we wanted to use them to modernize the education system in Latin America. If you ask one of these models for an example, it would probably tell you about George Washington.
We should be concerned about our own needs; we cannot wait for others to find the time to ask us what we need. Given that these are new and very disruptive technologies, there is room and a need for us, in our region, to take advantage of their benefits and understand their risks. Having this experience is essential to guiding the use of technology forward along the best path.
This also opens up possibilities for our researchers. Today, Latin American academics have few opportunities to interact in depth with these models. It is as if we wanted to study magnetic resonance imaging but didn’t have a resonator. Latam-GPT seeks to be that fundamental tool so that the scientific community can experiment and advance.
The key input is data. What is the status of the Latam-GPT corpus, and how are you addressing the challenge of including not only variants of Spanish and Portuguese, but also indigenous languages?
We have put a lot of emphasis on generating high-quality data. It’s not just about volume, but also composition. We analyze regional diversity to ensure that the data does not come disproportionately from just one country, but that there is a balanced representation. If we notice that Nicaragua is underrepresented in the data, for example, we’ll actively seek out collaborators there.
We also analyze the diversity of topics—politics, sports, art, and other areas—to have a balanced corpus. And, of course, there is cultural diversity. In this first version, we have focused on having cultural information about our ancestral peoples, such as the Aztecs and the Incas, rather than on the language itself. In the future, the idea is to also incorporate indigenous languages. At CENIA, we are already working on translators for Mapuche and Rapanui, and other groups in the region are doing the same with Guaraní. It is a clear example of something that we have to do ourselves, because no one else will.