The Unreasonable Redundancy of Nature's Protein Folds

Over the last few years, deep neural networks have made generative language modeling dramatically more powerful, giving us large language models. A similar leap happened for continuous modalities like images and videos. Recently, similar techniques have been applied to the generative modeling of biomolecules with great success. Models such as DeepMind's AlphaFold3 made it much easier to predict biomolecular interactions, including drug-protein and antibody-protein complexes, and soon after people figured out how to re-purpose these capabilities to design drug-like molecules. Chai-2, Latent-X2, and Nabla all report developable antibody or biologics designs. In the near future, we might see most antibodies entering the clinic designed in large part with deep-learning-based generative models, potentially with superior pharmaceutical properties and targeting receptors that have resisted wet-lab based approaches.

How would you improve on these systems? We definitely want to have better biomolecular modeling so we can put better drugs into the clinic. The recipe for improving a deep learning system has been surprisingly simple at a high level: you scale the model, scale the compute, and scale the data. LLMs are obviously improving by being scaled aggressively. AlphaFold3 was also a major effort to scale the model and data; it is trained on a broad collection of known biomolecular complexes, from experimental structures and protein-ligand complexes to the enormous sequence databases produced by genomics and metagenomics such as MGnify. Internally, DeepMind called the project "all-PDB" for a while, referring to all the interactions represented in the Protein Data Bank.

The key move in AlphaFold3's scaling recipe was to turn sequence scale into structure scale: use structure prediction to convert large protein sequence databases into predicted 3D structures. Genomics and metagenomics have given us billions of protein sequences, many inferred from environmental DNA collected from organisms that have never been cultured in the lab. For training structure-based design models, though, the useful object is often the 3D structure. Structure prediction models let us convert some of that sequence scale into structural data: take millions of natural sequences, predict the folds they adopt, and use those predicted structures as training examples for the next generation of biomolecular models.

At Ligo, we care about this recipe because we train generative models for designing enzymes. When we tried to scale our structural training data by folding more natural sequences, we ran into a problem: natural protein sequences are vast, but their folds are much more redundant than the sequence counts suggest. This post is about that mismatch, and about why simply folding more natural sequences may not buy as much new structural diversity as we hoped. We will describe data engineering tricks for clustering the known protein universe, and what our results imply about how to think about the enzyme design problem.

Modern biomolecular models rely on sequence scale

Modern structure prediction models rely heavily on multiple sequence alignments. A multiple sequence alignment, or MSA, lines up related versions of a protein from different organisms. When two positions in that alignment tend to change together, Coevolution means that two positions change in a coordinated way across related proteins. For example, if one position is usually negatively charged and touches a positively charged position, evolution may flip both together while avoiding pairs that would repel each other. it can be a clue that the corresponding residues are close in 3D space or tied together by function. My mental model of AlphaFold2 is that it used this kind of coevolutionary signal to constrain the rough geometry of a protein, then learned how to fill in the rest of the structure.

AlphaFold3 seems to be doing something broader. Its antibody-antigen performance is especially interesting because there are no MSAs to extract clues from. Antibodies and their targets do not share an evolutionary history. To do well there, the model has to learn something about protein surfaces themselves: which shapes, chemistries, and local geometries are likely to be compatible with each other. That is a different kind of signal than residue coevolution within one protein family.

This is where MGnify-scale data may matter. Metagenomic sequence resources expose models to enormous numbers of natural variants, many from organisms we have never cultured. The empirical clue is that models trained with MGnify-scale protein distillation seem to separate most clearly on antibody-antigen prediction, where direct coevolution cannot explain the interaction signal (Supplementary info). That increased coverage of sequence space looks valuable. The question is whether it also comes with comparable diversity in protein folds.

Sequence diversity is not fold diversity

... continue reading