Who Invented Backpropagation?

Efficient backpropagation (BP) is central to the ongoing Neural Network (NN) ReNNaissance and "Deep Learning." Who invented it?

BP's modern version (also called the reverse mode of automatic differentiation) was first published in 1970 by Finnish master student Seppo Linnainmaa [BP1] [R7]. In 2020, we celebrated BP's half-century anniversary! A precursor of BP was published by Henry J. Kelley in 1960 [BPA]—in 2020, we celebrated its 60-year anniversary.

In the 2020s, it was still easy to find misleading accounts of BP's history [HIN][T22][DLP][NOB]. I had a look at the original papers from the 1960s and 70s, and talked to BP pioneers. Here is a summary based on my award-winning 2014 survey [DL1] which includes most of the references mentioned below.

The minimisation of errors through gradient descent (Cauchy 1847 [GD'], Hadamard, 1908 [GD'']) in the parameter space of complex, nonlinear, differentiable, multi-stage, NN-related systems has been discussed at least since the early 1960s, e.g., Kelley (1960) [BPA]; Bryson (1961) [BPB]; Pontryagin et al. (1961); Dreyfus (1962) [BPC]; Wilkinson (1965); Tsypkin (1966) [GDa-b]; Amari (1967-68) [GD2,GD2a]; Bryson and Ho (1969); initially within the framework of Euler-LaGrange equations in the Calculus of Variations, e.g., Euler (1744).

Steepest descent in the weight space of such systems can be performed (Kelley, 1960 [BPA]; Bryson, 1961 [BPB]) by iterating the chain rule (Leibniz, 1676 [LEI07-10][DLH]; L'Hopital, 1696) in Dynamic Programming style (DP, e.g., Bellman, 1957 [BEL53]). A simplified derivation (Dreyfus, 1962 [BPC]) of this backpropagation method uses only the Leibniz chain rule [LEI07].

The systems of the 1960s were already efficient in the DP sense. However, they backpropagated derivative information through standard Jacobian matrix calculations from one "layer" to the previous one, without explicitly addressing either direct links across several layers or potential additional efficiency gains due to network sparsity.

Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NN-like networks was first described in a 1970 master's thesis (Linnainmaa, 1970, 1976) [BP1][R7], albeit without reference to NNs. This kind of BP is also known as the reverse mode of automatic differentiation (e.g., Griewank, 2012 [BP5]), where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early BP FORTRAN code (Linnainmaa, 1970) [BP1] and closely related but slightly later work by Ostrovskii et al. (1971) [BP1a] (apparently the first journal publication on backpropagation). As of 2020, all modern software packages for NNs (such as Google's Tensorflow) are based on Linnainmaa's method of 1970.

BP was soon explicitly used to minimize cost functions by adapting control parameters (weights) (Dreyfus, 1973). This was followed by some preliminary, NN-specific discussion (Werbos, 1974, section 5.5.1) and a computer program for automatically deriving and implementing BP in differentiable systems (Speelpenning, 1980). The first NN-specific application of efficient BP as above was apparently described by Werbos in 1982 [BP2] (but not yet in his 1974 thesis, as is sometimes claimed).

However, already in 1967, Amari suggested to train deep multilayer perceptrons (MLPs) with many layers in non-incremental end-to-end fashion from scratch by stochastic gradient descent (SGD) [GD1], a method proposed in 1951 [STO51-52]. Amari's implementation [GD2,GD2a] (with his student Saito) learned internal representations in a five layer MLP with two modifiable layers [DLH][NOB], which was trained to classify non-linearily separable pattern classes. Back then compute was billions of times more expensive than today.

Compare the first deep learning MLPs called GMDH networks (Ivakhnenko and Lapa, since 1965) whose layers are incrementally grown and trained by regression analysis [DEEP1-2][R8][DLH][NOB]. These were actually the first deep NNs that learned to create hierarchical, distributed, internal representations of incoming data.

... continue reading