This is nothing to be proud of, but I have never really studied optimisation in depth. Oh sure, I know my Adam from my AdaGrad and I even used L-BFGS one time, but when people start talking about dual spaces and convergence for L ∞ L^\infty L∞ continuous functions, I tend to glaze over a bit. For some reason I think in my head a lot of it seemed a bit old fashioned, whatever that means; why do I need to learn about optimising constrained convex functions in the gradient-descent-for-everything-wait-actually-just-use-a-pre-trained-model era? Isn’t that what they used to, like, make the cheapest possible diet? Boring!
Well, today on the blog, I’d like to talk a bit about my optimisation blind spot in the context of an interesting problem I have been working on related to protein binder design. The first way you think to do something is rarely the best; here we’re going to discuss a concrete example of that. If you, like me, are not an optimisation expert, then I hope you’ll learn something useful from this post. If you are, then feel free to have a good old laugh at my ignorance!
Setup
Ok so what’s the setup? Let’s say we have a categorical probability distribution with k k k categories where the probability of each category is given by the vector x ∈ R k \mathbf{x}\in\mathbb{R}^k x∈Rk. There are some constraints on x \mathbf{x} x for it to be valid:
the probabilities must be normalised: ∑ i = 1 k x i = 1 \sum^{k}_{i=1} x_i = 1 ∑ i = 1 k x i = 1 , and, the probabilities must be greater than 0: x i ≥ 0 ∀ i ∈ { 1 , ⋯ , k } x_i \geq 0 \ \ \forall i \in \{1, \cdots, k\} x i ≥ 0 ∀ i ∈ { 1 , ⋯ , k } .
We have a non-convex function f f f which takes in our probability vector, and spits out a real number. We want to find the x \mathbf{x} x that minimises this function x ∗ = argmin x ∈ Δ f ( x ) \mathbf{x}^{*} = \text{argmin}_{\mathbf{x}\in\Delta}f(\mathbf{x}) x∗=argminx∈Δf(x). In our setup, we will assume we can compute the gradient ∇ x f
abla_\mathbf{x} f ∇xf, but evaluation of the gradient, and indeed the function itself, is computationally expensive.
Protein related aside This is a simplified version of the problem of hallucination for de-novo binder design where x \mathbf{x} x represents a distribution over k = 20 k=20 k=20 amino acids, and f f f is a folding model, like Alphafold which takes a sequence of L L L of these amino acid distributions, called a position specific scoring matrix (PSSM), as input. We want to find the sequence that gives the best fold (according to some metric like ipSAE). In this case the input to f f f is now a matrix X ∈ R k × L \mathbf{X} \in \mathbb{R}^{k \times L} X∈Rk×L where each column is a probability vector.
A first attempt
When seeing a problem like this, where we need to optimise something with constraints, my first instinct is to try and re-parameterise the problem so that I don’t have to worry about them, and we can just use all the “normal” methods. We basically want to rewrite x \mathbf{x} x as a function of some other parameters with no constraints, then we can just optimise those. In this case, we can write
... continue reading