The lottery ticket hypothesis: why neural networks work

How AI researchers accidentally discovered that everything they thought about learning was wrong 18 Aug, 2025 The lottery ticket hypothesis explains why massive neural networks succeed despite centuries of theory predicting they should fail Five years ago, suggesting that AI researchers train neural networks with trillions of parameters would have earned you pitying looks. It violated the most fundamental rule in machine learning: make your model too large, and it becomes a glorified photocopier, memorising training data whilst learning nothing useful. This wasn't mere convention—it was mathematical law, backed by three centuries of statistical theory. Every textbook showed the same inexorable curve: small models underfit, optimal models generalise, large models catastrophically overfit. End of story. Yet today, those "impossible" massive models power ChatGPT, decode proteins, and have triggered a global arms race worth hundreds of billions. What changed wasn't just computing power—it was our understanding of learning itself. The story behind this transformation reveals how the biggest breakthrough in AI emerged from researchers bold enough to ignore their own field's foundational assumptions. The iron law that ruled machine learning For over 300 years, one principle governed every learning system: the bias-variance tradeoff. The mathematics was elegant, the logic unassailable. Build a model too simple, and it misses crucial patterns. Build it too complex, and it memorises noise instead of signals. Picture a student learning arithmetic. Show them thousands of addition problems with answers, and they might learn two ways. The intelligent approach: grasp the underlying algorithm of carrying digits and place values. The foolish approach: memorise every single example. The second strategy delivers perfect scores on homework but complete failure on the exam. Neural networks seemed especially vulnerable to this memorisation trap. With millions of parameters, they could easily store entire datasets. Traditional theory predicted these overparameterised networks would behave exactly like the memorising student—flawless on training data, hopeless on anything new. This understanding shaped everything. Researchers obsessed over architectural tricks, regularisation techniques, and mathematical constraints to squeeze performance from small, carefully controlled models. Scaling up was dismissed as expensive stupidity. The field's most respected voices reinforced this orthodoxy. "Bigger models just overfit," became the mantra. Conference papers focused on efficiency, not scale. The idea that simply adding more parameters might solve problems was academic heresy. The heretics who broke the rules In 2019, a group of researchers committed the ultimate sin: they ignored the warnings and kept scaling anyway. Instead of stopping when their networks achieved perfect training accuracy—the point where theory screamed "danger"—they pushed further into the forbidden zone. What happened next shattered 300 years of learning theory. The models didn't collapse. After an initial stumble where they appeared to memorise their training data, something extraordinary occurred. Performance began improving again. Dramatically. The phenomenon earned the name "double descent"—first the expected rise in error as models overfit, then an unexpected second descent as they somehow transcended overfitting entirely. Mikhail Belkin and his colleagues, who documented this discovery, noted it "contradicts conventional wisdom derived from bias-variance analysis." The implications rippled through AI research. OpenAI's subsequent work revealed these benefits extended across multiple orders of magnitude. Larger models weren't just accumulating more facts—they were developing qualitatively new capabilities, including the ability to learn tasks from mere examples. Suddenly, the entire field pivoted. Google, Microsoft, Meta, and OpenAI poured billions into building ever-larger models. The GPT series exploded from 117 million parameters to 175 billion. The "bigger is better" philosophy that theory had forbidden became the industry's north star. But one question haunted every researcher: why did any of this work? The lottery ticket that saved learning theory The answer emerged from an unexpected corner: a study of neural network lottery tickets. In 2018, Jonathan Frankle and Michael Carbin at MIT were investigating pruning—removing unnecessary weights after training. Their discovery would provide the elegant solution to the scaling paradox. Hidden within every large network, they found "winning tickets"—tiny subnetworks that could match the full network's performance. They could strip away 96% of parameters without losing accuracy. The vast majority of every successful network was essentially dead weight. But here lay the crucial insight: these winning subnetworks only succeeded with their original random starting weights. Change the initial values, and the same sparse architecture failed completely. The lottery ticket hypothesis crystallised: large networks succeed not by learning complex solutions, but by providing more opportunities to find simple ones. Every subset of weights represents a different lottery ticket—a potential elegant solution with random initialisation. Most tickets lose, but with billions of tickets, winning becomes inevitable. During training, the network doesn't search for the perfect architecture. It already contains countless small networks, each with different starting conditions. Training becomes a massive lottery draw, with the best-initialised small network emerging victorious whilst billions of others fade away. This revelation reconciled empirical success with classical theory. Large models weren't memorising—they were finding elegantly simple solutions hidden in vast parameter spaces. Occam's razor survived intact: the simplest explanation remained best. Scale had simply become a more sophisticated tool for finding those simple explanations. What intelligence actually looks like The implications transcend artificial intelligence. If learning means finding the simplest model that explains data, and larger search spaces enable simpler solutions, this reframes intelligence itself. Consider your brain: 86 billion neurons, trillions of connections, massively overparameterised by any measure. Yet you excel at learning from limited examples and generalising to new situations. The lottery ticket hypothesis suggests this neural abundance serves the same purpose—providing vast numbers of potential simple solutions to any problem. Intelligence isn't about memorising information—it's about finding elegant patterns that explain complex phenomena. Scale provides the computational space needed for this search, not storage for complicated solutions. The discovery also illuminates scientific progress. For decades, researchers avoided scaling because theory said it wouldn't work. The breakthrough came from empirical courage—testing assumptions rather than accepting them. This pattern echoes throughout science. Continental drift was dismissed until plate tectonics provided the mechanism. Quantum mechanics seemed absurd until experiments became overwhelming. The most important discoveries often require pushing beyond accepted theory's boundaries. Yet the lottery ticket hypothesis doesn't overturn classical learning—it reveals how those principles operate more sophisticatedly than imagined. Simple solutions remain optimal; we've discovered a better way to find them. For AI development, this understanding suggests both promise and limits. Scaling works because larger models provide more lottery tickets, more chances to find optimal solutions. But this mechanism implies natural bounds. As networks become more successful at finding minimal solutions, additional scale yields diminishing returns. This aligns with expert concerns about current approaches' limits. Yann LeCun argues that fundamental architectural constraints may prevent language models from achieving true understanding regardless of scale. The lottery ticket mechanism explains present success whilst hinting at future challenges. The elegant surprise The accidental discovery that revolutionised AI offers a profound lesson: the universe often holds elegant surprises for those bold enough to test conventional wisdom's boundaries. Sometimes the deepest insights come not from overturning established principles, but from discovering they operate more sophisticatedly than we imagined. Evolution itself follows similar principles, exploring vast genetic possibility spaces to find elegant survival solutions. The most successful organisms aren't the most complex—they're the most efficiently adapted. What seemed like a crisis for learning theory became its vindication. The bias-variance tradeoff survived, but we learned it operates through mechanisms far more subtle than anyone suspected. Large neural networks don't succeed by breaking the rules—they succeed by playing them at a level we never thought possible. The researchers who dared to scale beyond theoretical comfort zones didn't just advance AI—they reminded us that empirical reality sometimes holds wisdom that theory hasn't yet grasped. In a field built on mathematical certainty, the most important discovery came from embracing uncertainty itself.

The lottery ticket hypothesis: why neural networks work

Share this article

Related Articles