Skip to content
Tech News
← Back to articles

Reflecting to optimise

read original more articles
Why This Matters

This article highlights the importance of understanding optimization techniques beyond basic gradient descent, especially in complex applications like protein binder design. Mastering advanced optimization methods can lead to more effective solutions in the tech industry, particularly in fields requiring precise model tuning and constrained problem-solving.

Key Takeaways

This is nothing to be proud of, but I have never really studied optimisation in depth. Oh sure, I know my Adam from my AdaGrad and I even used L-BFGS one time, but when people start talking about dual spaces and convergence for L ∞ L^\infty L∞ continuous functions, I tend to glaze over a bit. For some reason I think in my head a lot of it seemed a bit old fashioned, whatever that means; why do I need to learn about optimising constrained convex functions in the gradient-descent-for-everything-wait-actually-just-use-a-pre-trained-model era? Isn’t that what they used to, like, make the cheapest possible diet? Boring!

Well, today on the blog, I’d like to talk a bit about my optimisation blind spot in the context of an interesting problem I have been working on related to protein binder design. The first way you think to do something is rarely the best; here we’re going to discuss a concrete example of that. If you, like me, are not an optimisation expert, then I hope you’ll learn something useful from this post. If you are, then feel free to have a good old laugh at my ignorance!

Setup

Ok so what’s the setup? Let’s say we have a categorical probability distribution with k k k categories where the probability of each category is given by the vector x ∈ R k \mathbf{x}\in\mathbb{R}^k x∈Rk. There are some constraints on x \mathbf{x} x for it to be valid:

the probabilities must be normalised: ∑ i = 1 k x i = 1 \sum^{k}_{i=1} x_i = 1 ∑ i = 1 k ​ x i ​ = 1 , and, the probabilities must be greater than 0: x i ≥ 0 ∀ i ∈ { 1 , ⋯ , k } x_i \geq 0 \ \ \forall i \in \{1, \cdots, k\} x i ​ ≥ 0 ∀ i ∈ { 1 , ⋯ , k } .

We have a non-convex function f f f which takes in our probability vector, and spits out a real number. We want to find the x \mathbf{x} x that minimises this function x ∗ = argmin x ∈ Δ f ( x ) \mathbf{x}^{*} = \text{argmin}_{\mathbf{x}\in\Delta}f(\mathbf{x}) x∗=argminx∈Δ​f(x). In our setup, we will assume we can compute the gradient ∇ x f

abla_\mathbf{x} f ∇x​f, but evaluation of the gradient, and indeed the function itself, is computationally expensive.

Protein related aside This is a simplified version of the problem of hallucination for de-novo binder design where x \mathbf{x} x represents a distribution over k = 20 k=20 k=20 amino acids, and f f f is a folding model, like Alphafold which takes a sequence of L L L of these amino acid distributions, called a position specific scoring matrix (PSSM), as input. We want to find the sequence that gives the best fold (according to some metric like ipSAE). In this case the input to f f f is now a matrix X ∈ R k × L \mathbf{X} \in \mathbb{R}^{k \times L} X∈Rk×L where each column is a probability vector.

A first attempt

When seeing a problem like this, where we need to optimise something with constraints, my first instinct is to try and re-parameterise the problem so that I don’t have to worry about them, and we can just use all the “normal” methods. We basically want to rewrite x \mathbf{x} x as a function of some other parameters with no constraints, then we can just optimise those. In this case, we can write

... continue reading