The Training Example Lie Bracket

Training Examples are Vector Fields and their Lie Brackets can be Computed

skip to results

An ideal machine learning model would not care what order training examples appeared in its training process. From a Bayesian perspective, the training dataset is unordered data and all updates based on seeing one additional example should commute with each other. For neural nets trained by gradient descent, however, this is not the case. This webpage will explain how to compute the effects of swapping the order of two training examples on a per-parameter level, and show the results of computing these quantities for a simple convnet model.

To get started, we just need to recognize one simple mathematical fact:

Training Examples are Vector Fields

If we are training a neural network with parameters $\theta \in \Theta = \mathbb{R}^\text{num params}$, then we can treat each training example as a vector field. In particular, if $x$ is a training example and $\mathcal{L}^{(x)}$ is the per-example loss for the training example $x$, then this vector field is:

$$ v^{(x)}(\theta) = -

abla_{\theta} \mathcal{L}^{(x)} $$

In other words, for a specific training example, the arrows of the resulting vector field point in the direction that the parameters should be updated.

In this view, a gradient update basically looks like moving in the direction of the vector field by the learning rate $\epsilon$.

... continue reading