Softmax, can you derive the Jacobian? And should you care?

Multiclass output? Softmax. Normalising probabilities? Softmax. Attention weights? Softmax. Partition function? You guessed it, Softmax. This function comes up everywhere, but how often have you really thought about what's going on inside?

What does softmax actually do to your distribution?

The softmax function is deceptively simple:

s o f t m a x ( x i ) = e x i ∑ j e x j \mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} softmax ( x i ) = ∑ j e x j e x i

We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine.

One useful way to think about softmax is that it maps vectors into a very specific geometric object: the probability simplex. For an n-dimensional output, this is the set of all vectors where each entry is non-negative and everything sums to 1. In 3 dimensions, this looks like a triangle sitting in 3D space; in higher dimensions, it's the same idea generalised. Softmax takes an unconstrained vector in R n \mathbb{R}^n Rn and smoothly projects it onto this simplex. The constraint that all outputs must sum to 1 is exactly what creates the interactions between dimensions that we'll see later in the Jacobian.

Let's visualize what this actually does in a real language model scenario - predicting the next token after "the cat sat on a":

Distribution Shift Left: raw logit values for candidate tokens. Right: probabilities after softmax. The highest logit ("mat" at 3.2) gets dramatically amplified to 48% probability, while others are suppressed. The transformation turns unbounded scores into a probability distribution that sums to 1.

The transformation is pretty dramatic. The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates — it's very opinionated about which class should win.

We can see this "winner takes most" behavior even more clearly with a batch of vectors:

... continue reading