A Theory of Deep Learning

Borges wrote a story about a man named Funes who, after a horseback accident, acquires the ability to perceive and remember everything. Every leaf on every tree. Every ripple on every stream at every moment. He is the perfect empiricist. Infinite data, infinite recall, infinite resolution. And he cannot think. Because thinking, as Borges understood, requires forgetting. Funes could reconstruct entire days from memory but could not understand why the dog at 3:14, seen from the side, should be called the same thing as the dog at 3:15, seen from the front.

I suspect [that Funes] was not very good at thinking. To think is to ignore (or forget) differences, to generalize, to abstract. In the teeming world of Ireneo Funes there was nothing but particulars. Jorge Luis Borges, "Funes the Memorious," in Ficciones (1944).

Later in the story, Borges conjures Locke, who in the seventeenth century postulated an impossible language in which each individual thing, each stone, each bird and each branch, would have its own name. Funes projected an analogous language but discarded it because it seemed too general to him, too ambiguous. Deep learning theory has built Locke's language and is well on its way to Funes'. More parameters. More data. Deeper networks. More compute. Uniform convergence people, optimization people, NTK people, PAC-Bayes people, stability people, mean-field people, all working on the same problem, none of them speaking the same language, each proving bounds under assumptions that are vacuous under each other's assumptions.

Deep learning alchemy today is where chemistry was before Lavoisier: a practice that works, built on a theory that doesn't. Everyone agrees this is a problem. Few believe it is a solvable one. At the Diffusion Group at Stanford, we have been trying for some time to answer this question, which most of our colleagues consider premature and quixotic: why does deep learning work? We think we have an answer.

But first, to see why the question is hard, start with what classical theory predicts. Classical statistical learning theory posits the bias-variance tradeoff: too simple and you underfit the data, too expressive and you overfit. Deep neural networks are highly expressive and overparameterized—they have far more parameters than data points; they can shatter any possible labeling of the data. During training, the network interpolates the training data perfectly, including all noise, achieving zero error. Surely, the test error should be catastrophic. Zhang et al., "Understanding Deep Learning (Still) Requires Rethinking Generalization," Communications of the ACM 64, no. 3 (2021). The original 2017 version demonstrated that standard architectures can memorize random labels, establishing that classical capacity-based explanations of generalization are insufficient. But then, the test error…

is also very low.

This is called benign overfitting. It violates the most basic intuition in statistical learning theory. Bartlett et al., "Benign Overfitting in Linear Regression," PNAS 117, no. 48 (2020). You fit the training data exactly, so presumably the noise must have been destroyed, or rendered harmless in some form.

Trying to visualize the bias-variance tradeoff with neural networks doesn't yield the expected U-shaped curve, but instead shows double descent. Test error goes up as model complexity increases, then comes back down past the interpolation threshold. Belkin et al., "Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-off," PNAS 116, no. 32 (2019). At the exact moment the network gains the capacity to memorize everything, it begins to generalize.

Gradient descent, given infinitely many solutions that interpolate the data, picks ones that generalize (usually low \(\ell_2\)-norm, low nuclear norm, approximately low-rank). This is called implicit bias. Gunasekar et al., "Implicit Regularization in Matrix Factorization," NeurIPS (2017), and Soudry et al., "The Implicit Bias of Gradient Descent on Separable Data," JMLR 19 (2018).

Lastly, in cases where the data-generating distribution is highly structured and the network doesn't possess the right inductive bias, the network memorizes the training set, then much later, hundreds of thousands of steps later, suddenly generalizes. This is grokking. Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets," arXiv:2201.02177 (2022).

... continue reading