An idiot's guide to lead optimisation for proteins

Or, understanding the Cradle-1 pipeline.

Lead optimisation is the step in drug design where you take a molecule that sort of works and try to make it actually work, and it’s arguably the step where most real-world design campaigns succeed or die. Due to the influence of a couple of my pals, I have recently become interested in using machine learning to do lead optimisation for proteins. Said pals have been kind enough to indulge my extremely beginner level questioning over the few weeks. I’m going to use this post to share what they have taught me, in the hope it might in turn help you understand a bit more about this fascinating area. Like any field, there are some established principles that are never spelled out explicitly in the literature, which can make it confusing for newcomers. We’re going to try to understand how we could actually build a real system for lead optimisation but studying one that has been shown to work well in real life.

Firstly, before we get into it, what actually is a protein? I think to answer this in full we would need multiple textbooks/degrees. Since basically all of my knowledge of biology comes from one read through of Philip Ball’s How Life Works I am convinced the answer to any question about biology is arbitrarily arcane and complex, so for now it will suffice to say that proteins are a class of molecule which are an integral part of basically all the processes that sustain life. Proteins are chains of smaller molecules called amino acids of which there are 20 different types. As such we can represent a protein as a string of characters, with a character standing for a different amino acid, using all the letters of the alphabet apart from BJOUXZ. For example, we can write myoglobin, a protein responsible for shuttling oxygen around our cells, as:

MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG

For certain combinations of amino acids, this chain folds up into a fixed shape, which allows that protein to perform a function. Predicting the shape of the fold from its sequence is, to put it mildly, highly non-trivial, and is the problem the AlphaFold-2 model “solved” to some extent to win the Nobel Prize in Chemistry. It is important to understand that there are many possible combinations of amino acids, and most of them will not fold into a regular, predictable shape, and do not do anything useful. The folded myoglobin looks something like this:

The structure of myoglobin. Public domain, via Wikimedia Commons.

Typically proteins consist of about 300 amino acids, but the largest, which is comically named PKZILLA, contains upwards of 40,000, and they can be as small as 20. There are estimated to be between 80,000 and 400,000 different proteins that perform functions in human cells. Note that when the amino acids are bonded to form the protein each one is known as a residue. The proteins that exist in nature are the result of evolution. Because of this we can group proteins into families which have some similarity to each other; within these families a large proportion of the sequence representing each protein will be the same. So now we very roughly know what a protein is, what is protein design?

The ultimate goal of protein design is to generate new molecules that perform a certain function. This function can range from catalysing a specific chemical reaction to binding to a disease causing molecule. Lead optimisation is one of the most important steps in the design process. We assume that we have an existing template molecule that is functional to some extent, but is not sufficient for our ultimate goal. This molecule might have been the result of a previous failed design campaign, generated de novo (i.e. from scratch) by some other model, or chosen from nature. The process of lead optimisation involves proposing changes to this initial molecule with the ultimate goal of improving its properties for the task at hand. In practice, the function of the protein for this task can rarely be summarised by a single number and we usually care about several properties at once and these can trade off against each other. For simplicity, we’ll mostly talk as if we’re optimising a single property in this post, but keep in mind that the real picture is multi-objective.

The process of lead optimisation proceeds by proposing a set of candidate changes, testing them in the lab, and then proposing more changes after integrating the results of the lab tests. The way this traditionally worked was by using directed evolution, which essentially involves introducing random mutations, testing them and keeping the ones that improve function and mutating them further in a loop until we get to sufficient performance. Given that we have massive databases of proteins, and the awesome power of deep learning, these days we can probably do better. And indeed some people have!

Cradle is a bio-tech startup that sells a system for ML-based protein lead optimisation. They’re somewhat unusual for a bio-ML company in operating their own wet lab, which they claim gives them a unique ability to keep the loop between model suggestions and experimental feedback tight. Their system seems to be a market leader, and they are working with a number of the world’s largest pharmaceutical companies (e.g. Novo Nordisk, Bayer, J&J). They have demonstrated impressive results across a number of different contexts. Cradle recently published a white-paper, describing how their system works at a high level. Here it is:

... continue reading