Sampling at negative temperature

Summary: Inspired by the definition of temperature in statistical mechanics and the possibility for it to be below zero, we try sampling LLaMA at T = − 0.001 T=-0.001 T=−0.001. The results are maximally weird.

The notion of temperature comes from statistical mechanics. Consider a system that has states with energies E 1 , … , E n E_1, \dots, E_n E1,…,En. If the system is in thermal equilibrium, the probability distribution over states is given by the Boltzmann distribution:

p i = e − E i / k B T ∑ i e − E i / k B T p_i = \frac{e^{-E_i/k_BT}}{\sum\limits_i e^{-E_i/k_BT}} p i = i ∑ e − E i / k B T e − E i / k B T

The distribution is parameterized by a single number, the temperature T T T. At lower temperatures the lowest-energy states predominate; at higher temperatures there is a more even mix.

At the last layer of a neural net, we apply the softmax function to the neuron activations { z i } \{z_i\} {zi} to get a vector of probabilities that sum to 1:

p i = e z i / T ∑ i e z i / T p_i = \frac{e^{z_i/T}}{\sum\limits_i e^{z_i/T}} p i = i ∑ e z i / T e z i / T

Wait — this is just the Boltzmann distribution, up to a constant![1]

[1] There's no minus sign in the exponent because while higher-energy states are less likely, larger logits are more likely.

In a language model, temperature is used to define how creative text generations are. For instance, in the zero temperature limit, the model should deterministically generate the most likely token. In the infinite temperature limit, all tokens are equally likely and the model output will be random noise. For an interactive explanation, see here.

What would it mean to have a temperature that is below zero? (This isn't the same as the negative Fahrenheit or Celsius temperatures we get on a cold day in Vermont — I mean below zero on an absolute scale like Kelvin).

... continue reading