RNA structure prediction is hard. How much does that matter?

Note: I am not an expert in RNA structure, and am extremely grateful to Connor Stephens, Rishabh Anand, Ramya Rangan, and Chaitanya K. Joshi—all of whom are actual, bonafide experts—for their incredibly detailed comments on earlier drafts of this essay. All mistakes are, of course mine, and this essay should not be trusted to function as anything more than entertainment. Do your own research!

Introduction

One thing I’ve always wanted to write was ‘a primer to RNA structure modeling’. I know literally nothing about the field, other than that there are a few startups playing in the space, and have always been curious what exactly they were up to. But the release of Alphafold3—which can model RNA alongside proteins, DNA, and small molecules—dampened this desire. If a singular model solved the problem of RNA structure, who cares about the specifics of the field at large?

But while I was in San Francisco a few months back, I happened to chat with Connor Stephens, a machine learning scientist at Atomic AI. You may recognize that startup, since their founder has the distinct honor of their PhD work in RNA structure modeling being on the cover of Science in 2021 for making a substantial advance in RNA structure prediction.

But it was long unclear to me what exactly Atomic AI exactly did in terms of R&D. This isn’t a startup post, I’m not planning to explain what their therapeutic goals are. What I was curious about was why they continue to have an ML team despite the RNA problem being seemingly solved by Alphafold3. So, I posed that question to Connor.

Connor told me something very fascinating: not only did Alphafold3 not solve the problem of RNA structure prediction, RNA may be one of the last structure prediction problems to be solved. The rest of the conversation was so incredibly fun that, midway through it, I decided it’d make for a great article to write about.

Why is RNA structure so hard to model?

On face value, the answer is pretty simple: experimentally determined RNA structures deposited in public repositories are both ridiculously small in number and of much lower quality than you’d naively expect. A quote from a paper best explains this:

There is a huge disparity in protein and RNA data. Even if there is a higher proportion of RNAs than proteins in the living, this is not reflected in the available data: only a small amount of 3D RNA structures are known. Up to June 2024, 7,759 RNA structures were deposited in the Protein Data Bank (34), compared to 216,212 protein structures. The quality and diversity of data are also different: a huge proportion of RNAs come from the same families. It implies several redundant structures that could prevent a model from being generalized to other families. In addition, a huge amount of RNA families have not yet solved structures in the PDB. This means there is no balanced and representative proportion of RNA families through the known structures.

The obvious follow-up question is: why? Apparently, RNA is a good fit for basically none of the existing structure determination methods. But again, why?

... continue reading