The case for the return of fine-tuning

Déjà Tune Most of my reading this week focused on fine-tuning, sparked by Thinking Machines Labs’ announcement of Tinker. The six-month-old, already $12B-valued startup founded by OpenAI’s former CTO Mira Murati wants to bring fine-tuning back into the spotlight with a new fine-tuning-as-a-platform initiative positioned as a foundation for research collaborations with universities. A few days later, Clément Delangue from Hugging Face posted that he sensed a paradigm shift toward self-managed, open-source, and specialized LLM deployments, even backed by dedicated hardware like NVIDIA’s DGX Spark, many conversations with founders about growing client demand, or the Personal AI Workstation, a very clever marketing stunt from a16z (I’m jealous). All of this feels like a déjà vu. For a brief moment after the first wave of large language models, fine-tuning was the hottest topic in machine learning. Then, just as quickly, it disappeared from most production systems, now accounting for less than 10% of AI inference workloads. So, how did fine-tuning get sidelined so fast, and why do we feel it could be time for a come-back? And more importantly, what could be different this time? Attention, Please Before the Transformer breakthrough, the spark that led to the LLMs we use today, NLP relied on specialized models. Early progress came from recurrent architectures like RNNs and LSTMs, which for the first time learned directly from word sequences instead of relying on hand‑crafted linguistic features. A step forward in representations, but without any learning paradigm that would define later foundation models. Each application required to start from scratch on task-specific data. In 2017, Google’s Attention Is All You Need introduced the Transformer architecture, replacing recurrence and convolution with self‑attention alone. Seven months later, ULMFiT demonstrated that a pretrained language model (at the time still based on LSTMs) could be fine‑tuned for different tasks, and helped establish the methodological foundation that made Transformers practically useful. And a year later, models like BERT and GPT‑1 applied the design in practice. BERT leveraged the encoder side with bidirectional attention for understanding, while GPT used the decoder side with unilateral attention for generation, beautifully illustrated in The Illustrated BERT, ELMo, and Co and The Illustrated GPT-2, two readings recommended by Cédric Deltheil (Finegrain). BERT in particular reshaped the culture of NLP: instead of building every model from scratch, researchers could “just fine-tune” a pretrained Transformer and achieve results that once required months of manual feature engineering. Everything was fine until LLM madness turned into a full-blown parameter explosion, with models jumping from millions to hundreds of billions of parameters (and then some). Fine-tuning was no longer a smart idea. What had once been a few GPU-hours task quickly became a massive industrial operation and raised serious deployment challenges. This practice, called Full Fine-Tuning (FFT), meant retraining every layer and every weight. It offered precision but at a brutal cost. Then, in 2021, Microsoft Research introduced LoRA (Low-Rank Adaptation of Large Language Models). Instead of retraining billions of parameters, LoRA freezes the original weights and adds small low-rank matrices to selected layers. Only these are trained, cutting costs by an order of magnitude while maintaining (and sometimes even improving) FFT performance. Unsurprisingly, LoRA became the default. And by 2024, thanks to Hugging Face’s PEFT library, implementing it was a one-line command. Finding The Right Tune A surface-level simplicity masking an ugly truth: fine-tuning is more than a package to deploy and maintain. The tuning itself is where the real magic happens, and it is never a one-config-fits-all process. Generally speaking, this hyperparameter tuning alone can make or break a model. The critical process feels more like alchemy than science, balancing ranks, learning rates, and alpha ratios while hoping your adapters don’t overfit or forget what the model already knows (a phenomenon called catastrophic forgetting). And when you finally get something that works, evaluating it feels less like validation and more like divination. Meanwhile, LLMs kept getting better at nearly every task, approaching something close to omniscience. Writing a leasing contract? A whole chapter on the Industrial Economy in Rural Britain in the nineteenth century? A movie script? A REST API or a website for your mom? By 2023, most teams realized they could achieve about 90% of fine-tuned performance through prompt engineering, thanks to larger context windows, or RAG (Retrieval-Augmented Generation, which gives the model access to an external knowledge base). Both approaches required no retraining and came with far less operational burden for quite decent results. Tuning Back In But why does fine-tuning seem to be gaining fans again? Maybe because the very factors that once made it irrelevant or inefficient are now, piece by piece, being solved. GPU-as-a-service platforms such as Together.ai allow you to spin up a LoRA fine-tuning pipeline with minimal friction and in many cases in minutes. While new models still come fast, the changes are now more evolutionary than revolutionary. That shift means tuning a model today is less likely to be totally invalidated tomorrow. Open-weight ecosystems like Mistral, Llama, Falcon, Yi, and Gemma offer a lot of alternatives for organizations to own, inspect, and persist their fine-tuned variants without vendor lock-in. Finally, companies may have reached the ceiling of what can be achieved with prompting alone. Some want models that know their vocabulary, their tone, their taxonomy, and their compliance rules. For all those reasons, fine-tuning is slowly dipping its toes back into the spotlight, not as a trendy feature this time, but as a strategic lever for control, differentiation, and embedded intelligence. Thinking Machines Lab’s Tinker, which focuses on theorem proving, chemistry reasoning, multi-agent reinforcement learning, and AI safety, embodies this shift. And they have already made several recommendations shared in their blog post LoRA Without Regret, which explores how to fine-tune more effectively. They recommend applying LoRA to all linear modules, not just attention layers as in the original paper. They also emphasize the importance of LoRA rank, a hyperparameter often overlooked, and advise setting higher learning rates (at least 10x), smaller batch sizes (the opposite of common practice), and defining reward functions explicitly with mathematical or logical verification. All these recommendations are clearly explained and reproducible on Hugging Face’s TRL. This is great for tuning, but maybe Tinker’s major contribution to fine-tuning is elsewhere. The Best of Both Worlds As expected, the modern fine-tuning pipeline looks nothing like what we used five years ago. It is modular, serverless, and orchestrated. A single deployment can run a base model alongside dozens of LoRA adapters, each representing a specific tone, function, or domain. During inference, the system routes queries to the right combination of adapters instead of relying on a static model file. Such modularity introduces its own challenges. All-in-one platforms like Together.ai handle most of the heavy lifting, but they often lack the granularity many teams need for fine-grained configuration and observability, and costs at scale can escalate quickly. On paper, Tinker seems to offer the best of both worlds: the comfort of a modern, fully managed fine-tuning stack combined with fine-grained control for researchers. It provides direct API access to low-level training primitives, allowing users to orchestrate training workflows and custom algorithms at the deepest level while taking care of the grunt work. Hopefully this will inspire other platforms, since Tinker is currently reserved for research purposes. Anyway, as imperfect as they are, infrastructure challenges are becoming problems of the past. Yet one major difficulty remains: evaluation. Models are notoriously difficult to evaluate. Human evaluation is inconsistent, slow and more importantly expensive, while benchmarks age quickly and lose relevance with data contamination. Even automated approaches such as G-Eval or Chatbot Arena introduce their own problems, often amplifying bias and producing unstable scores. Following the announcement of Tinker, Benjamin Anderson suggested they might have part of the solution. As he wrote in “Anatomy of a Modern Fine-Tuning API” on his blog, “It gives the user the power to do online reinforcement learning: take a completion from the current model weights, score that completion, and update the model based on whether that completion was good or bad. Supervised fine-tuning teaches the model to mimic pre-written responses, while online RL improves it by scoring its own responses.” Therefore, with such architectures, the future of fine-tuning may not look like fine-tuning at all: it starts to resemble continuous learning. A Very Different Tune “Theoretically, fine-tuning has always made sense. But the speed at which closed-source labs were scaling their model intelligence made it a practically bad bet. Now, with compute, data, and better frameworks, the scales are tipping back toward specialization.” said Robert Hommes from Moyai.ai, a company specialized in adaptive fine-tuning. And regarding the shift to self-hosting, this could go even closer to you than you might think. As Constant Razel from Exxa put it, “Personal AI computers are no longer a distant idea. Technology is improving and becoming more accessible. Security and cost will likely drive early adoption, while fine-tuning will enable specialized, high-performance agents to run on them.” From a brute-force chase for marginal accuracy to a framework for ownership, alignment, and continuous improvement rooted in proximity and control, fine-tuning might have found a new territory to thrive. It is no longer just a technical step but perhaps a strategic layer in how intelligence will be built and owned. 🙌 Thanks to Robert Hommes, Benjamin Trom, Etienne Balit, Constant Razel, Cédric Deltheil, and Willy Braun for the insights, suggestions, and proofreading.

The case for the return of fine-tuning

Share this article

Related Articles