AI systems have recently had a lot of success in one key aspect of biology: the relationship between a protein’s structure and its function. These efforts have included the ability to predict the structure of most proteins and to design proteins structured so that they perform useful functions. But all of these efforts are focused on the proteins and amino acids that build them.
But biology doesn’t generate new proteins at that level. Instead, changes have to take place at the nucleic acid level before eventually making their presence felt at the protein level. And the DNA level is fairly removed from proteins, with lots of critical non-coding sequences, redundancy, and a fair degree of flexibility. It’s not necessarily obvious that learning the organization of a genome would help an AI system figure out how to make functional proteins.
But it now seems like using bacterial genomes for the training can help develop a system that can predict proteins, some of which don’t look like anything we’ve ever seen before.
Training a genome model
The new work was done by a small team at Stanford University. It relies on a feature that’s common in bacterial genomes: the clustering of genes with related functions. Often, bacteria have all the genes needed for a given function—importing and digesting a sugar, synthesizing an amino acid, etc.—right next to each other in the genome. In many cases, all the genes are transcribed into a single, large messenger RNA. This gives the bacteria a simple way to control the activity of entire biochemical pathways at once, boosting the efficiency of bacterial metabolisms.
So, the researchers developed what they term a “genomic language model” they call Evo on an enormous collection of bacterial genomes. The training was similar to what you’d see in a large language model, where Evo was asked to output predictions of the next base in a sequence, and rewarded when it got it right. It’s also a generative model, in that it can take a prompt and output novel sequences with a degree of randomness, in the sense that the same prompt can produce a range of different outputs.