When we look at the massive scale of today’s Artificial Intelligence boom, it is easy to forget that the foundations of this trillion-dollar industry were laid down over 30 years ago in Munich.
Today, the world's top tech companies are investing hundreds of billions into scaling up Large Language Models (LLMs) such as ChatGPT. Yet, outside of a few history buffs or old-school folks in the Machine Learning community, people might not realize that virtually every core building block of these modern systems was published in a span of just a few months back in 1991. Incredibly, they all emerged from a single lab at the Technical University Munich led by Jürgen Schmidhuber.
Before that year ended, his team had essentially mapped out the modern era of deep learning. They published the very first Transformer variant (see ChatGPT's "T"), introduced the concept of unsupervised pre-training (ChatGPT's "P"), and pioneered neural network distillation. They also introduced deep residual learning, the centerpiece of both LSTMs and ResNets, the most cited AI papers of the 20th and the 21st century, respectively. These four techniques power today's most advanced LLMs. Furthermore, they laid the early groundwork for generative adversarial networks, foundational for "Generative AI."
Jürgen’s contributions have deeply shaped my own thinking over the years, from my time at Google Brain to our recursive self-improvement (RSI) research we're currently pushing at Sakana AI. I am especially proud to have helped popularize World Models back in 2018, building directly on concepts his lab introduced in the 1990s.
It is amazing to see how well some of these ideas have stood the test of time, scaling up to be fully embraced by the global AI community! For those interested in the real history of deep learning, Jürgen has put together a detailed timeline below of exactly how these seeds were planted in Munich in 1991.
David Ha, June 2026
Jürgen Schmidhuber's 1991 Timeline, with Annotated References
★ 26 March 1991: the first kind of Transformer (see the T in ChatGPT)—now called the unnormalized linear Transformer [ULTRA][FWP0-6][WHO10][DLH]: the predecessor of the normalized quadratic Transformer [TR1]. ULTRA is still important, also because of its efficiency: its computational costs scale linearly in input size, rather than quadratically.
★ 30 April 1991: Pre-Training for deep neural networks (NNs)—the P in ChatGPT [UN0][UN1][UN2][UN][DLH]. This enabled very deep learning [WHO5].
★ 30 April 1991: Neural network distillation—central to the famous 2025 DeepSeek "Sputnik" and other Large Language Models (LLMs) [UN0][UN1][UN2][WHO9][DLH].
... continue reading