Accurate predictions of disordered protein ensembles with STARLING

IDRs are structurally heterogeneous protein regions that are estimated to constitute approximately 30% of eukaryotic proteomes1. Despite lacking a fixed structure, IDRs have key roles in essential cellular processes such as transcription, translation and cell signalling1. Owing to their broad structural heterogeneity, IDRs must be described by a conformational ensemble: a large collection of structurally distinct and interchanging conformations1. However, although IDRs cannot be represented as a single 3D structure, they do still possess sequence-encoded conformational biases, and ensembles can have essential roles in IDR function and may be perturbed in disease2,3,4. Just as structural biology has been instrumental in understanding the molecular basis for folded domain function, there is a growing appreciation that the characterization of IDR ensembles may be important for understanding IDR function5,6.

Various experimental techniques have been applied to interrogate sequence–ensemble relationships7,8,9. Although these report on specific aspects of IDR ensembles, they fall short of providing a holistic description of the distribution of conformers (that is, the 3D coordinates of all residues in the protein across many different conformations, referred to here as a ‘full structural ensemble’). To achieve this, the integration of computational modelling with experimental data has proven an effective route10,11,12,13.

Conceptually, computational models and experiments can be combined in two different ways. One approach involves using physics-based models and reweighting or biasing towards experimental observables14,15. Another involves using experimental data to parameterize transferable force fields, which, in principle, do not require additional reweighting16. Although both approaches have been effective in providing insight into IDR behaviour, they require deep technical expertise to ensure reliable conclusions are drawn, and can also be computationally expensive.

Although recent advances in coarse-grained simulations17,18,19 offer a faster alternative (and both modalities described above can be applied here), even coarse-grained simulations can still take hours to obtain sufficient sampling and still require a relatively high level of technical expertise to set up, run and analyse. Recent deep learning predictors trained on coarse-grained simulations have enabled proteome-scale predictions for ensemble average values, but are limited to specific observables for which a predictor was trained (for example, radius of gyration (R g ) or radius of end-to-end distances (R e ))20,21.

Deep learning approaches have transformed protein structure prediction, significantly reducing the barrier to large-scale exploration of sequence–structure relationships22 (Fig. 1a). However, these methods are poorly suited to investigate IDRs23 owing to a reduction in alignment-based conservation in disordered proteins, a paucity of appropriate experimental training data and optimization for an inappropriate objective (predicting a single best structure from sequence, whereas IDRs should be described by a large, conformationally heterogeneous ensemble; Fig. 1b). In short, although we now possess easy-to-use tools for accurately predicting the 3D structure of folded proteins, equivalent tools for rapid and accurate IDR ensemble prediction are lacking.

Fig. 1: STARLING approach and model architecture. a, Deep learning has revolutionized protein structure prediction, with major advances being facilitated by large-scale evolutionary information. b, Structure prediction methods for folded domains are poorly suited for predicting the behaviour of IDRs. These limitations stem from the absence of a native-state structure and because evolutionary information is often poorly captured in multiple sequence alignments (MSAs) of IDRs. c, Generative text-to-image models enable the creation of many unique and independent images consistent with a single input prompt. d, IDR ensemble generation shares many similarities with text-to-image generation; we required many distinct, uncorrelated conformers, all of which are consistent with an input prompt (an amino acid sequence). e, STARLING was trained on approximately 50,000 amino acid sequences at 150 mM ionic strength and approximately 14,000 sequences at 20 mM and 300 mM ionic strength. The sequences performed at 20 mM and 300 mM ionic strength are a subset of those simulated at 150 mM ionic strength. For each sequence, hundreds of distinct conformers were generated using coarse-grained molecular dynamics simulations, and each conformer was converted into a distance map (an image). Sequences were split into training, testing and validation sets. The MARV cartoon was reproduced from ref. 61, Martin Steinegger. f, STARLING makes use of a VAE to compress distance maps to a latent space, allowing a denoising diffusion model to work in this latent space (‘latent diffusion’). g, The overall architecture of the STARLING model combines a latent-space probabilistic denoising diffusion model with a vision transformer architecture using both convolutions and transformer blocks. Latent-space maps were decoded to real space via the VAE decoder. Finally, distance maps can be reconstructed into 3D coordinates via a parallelized multidimensional scaling approach. Full size image

Here we addressed these challenges by developing a fast and accurate approach for predicting full coarse-grained disordered protein structural ensembles directly from the amino acid sequence. Our approach leveraged advances in generative modelling, a deep learning technique capable of creating new and original data. However, developing a generative model poses a key challenge: the need for large training datasets. To address this, we performed large-scale coarse-grained simulations to generate full structural ensembles across tens of thousands of natural and synthetic IDRs. The resulting method — STARLING — allowed us to generate structural ensembles directly from sequence in seconds. A major goal in developing STARLING was to avoid hardware barriers. Although STARLING is fast (approximately 35 conformers per second) on GPUs, it can still generate ensembles in minutes on Intel/AMD CPUs and seconds on Apple CPUs.

STARLING-generated ensembles show good agreement with experimental data, enabling de novo exploration of uncharacterized IDRs or aiding in the biophysical interpretation of experimental data. Moreover, here we show proof of concept that STARLING can be used to (1) investigate sequence–ensemble relationships for disordered proteins; (2) explore bound-state conformational ensembles for binary disordered protein complexes; and (3) provide conformation-aware latent representations for IDR characterization, search and design. Together, we propose that the ease of use and speed of STARLING make it a powerful tool for democratizing large-scale exploration of sequence–ensemble relationships in IDRs.

Generative artificial intelligence (AI) has been transformative for text-to-image generation24,25. In text-to-image generative AI, an image is generated by passing a prompt (a short phrase describing the desired image) to a pre-trained deep learning model. The model then generates an image consistent with the prompt, a process referred to as inference. Deep learning models capable of inference must first be trained. For models used in modern text-to-image generative AI, training is not simply memorization but also learning the relationship between prompts and the features in the associated images. As a result, once a model is trained, if the same prompt is reused many times, it will generate many independent images entirely distinct from any of the individual instances observed in training. Of note, despite being different from one another, each generated image should be consistent with the prompt (Fig. 1c).

The mapping of a single text prompt to a collection of different images — each of which is consistent with the input prompt — is precisely the problem that we wished to solve for IDR ensemble generation. In IDR ensemble generation, we wanted to take a text prompt (amino acid sequence) and generate a collection of many distinct and uncorrelated IDR conformations consistent with that prompt (Fig. 1d). Moreover, we wanted this generation process to be fast (seconds) and possible on commodity hardware (laptops and desktops). To achieve this goal, we combined a variational autoencoder (VAE) with a discrete-time denoising-diffusion probabilistic model (DDPM) to create a latent diffusion model25. The resulting method (STARLING) enabled the accurate and rapid prediction of coarse-grained conformational ensembles of IDRs.

... continue reading