Skip to content
Tech News
← Back to articles

Softmax, can you derive the Jacobian? And should you care?

read original get Softmax Function Calculator → more articles
Why This Matters

Understanding the softmax function and its Jacobian is crucial for deepening insights into how neural networks, especially language models, convert raw scores into probability distributions. This knowledge impacts the development and optimization of models, enabling more precise adjustments and interpretability. For consumers, it means more reliable and accurate AI-driven predictions and applications.

Key Takeaways

Multiclass output? Softmax. Normalising probabilities? Softmax. Attention weights? Softmax. Partition function? You guessed it, Softmax. This function comes up everywhere, but how often have you really thought about what's going on inside?

What does softmax actually do to your distribution?

The softmax function is deceptively simple:

s o f t m a x ( x i ) = e x i ∑ j e x j \mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} softmax ( x i ​ ) = ∑ j ​ e x j ​ e x i ​ ​

We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine.

One useful way to think about softmax is that it maps vectors into a very specific geometric object: the probability simplex. For an n-dimensional output, this is the set of all vectors where each entry is non-negative and everything sums to 1. In 3 dimensions, this looks like a triangle sitting in 3D space; in higher dimensions, it's the same idea generalised. Softmax takes an unconstrained vector in R n \mathbb{R}^n Rn and smoothly projects it onto this simplex. The constraint that all outputs must sum to 1 is exactly what creates the interactions between dimensions that we'll see later in the Jacobian.

Let's visualize what this actually does in a real language model scenario - predicting the next token after "the cat sat on a":

Distribution Shift Left: raw logit values for candidate tokens. Right: probabilities after softmax. The highest logit ("mat" at 3.2) gets dramatically amplified to 48% probability, while others are suppressed. The transformation turns unbounded scores into a probability distribution that sums to 1.

The transformation is pretty dramatic. The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates — it's very opinionated about which class should win.

We can see this "winner takes most" behavior even more clearly with a batch of vectors:

... continue reading