OpenAI just released its first ever open-source large language models, called gpt-oss-120b and gpt-oss-20b. You can talk to them here. Are they good models? Well, that depends on what you’re looking for. They’re great at some benchmarks, of course (OpenAI would never have released them otherwise) but weirdly bad at others, like SimpleQA.
Some people really like them. Others on Twitter really don’t. From what I can tell, they’re technically competent but lack a lot of out-of-domain knowledge: for instance, they have broad general knowledge about science, but don’t know much about popular culture. We’ll know in six months how useful these models are in practice, but my prediction is that these models will end up in the category of “performs much better on benchmarks than on real-world tasks”.
Phi models and training on synthetic data
In 2024, Sebastien Bubeck led the development of Microsoft’s open-source Phi-series of models. The big idea behind those models was to train exclusively on synthetic data: instead of text pulled from books or the internet, text generated by other language models or hand-curated textbooks. Synthetic data is less common than normal data, since instead of just downloading terabytes of it for free you have to spend money to generate each token. But the trade-off is that you have complete control over your training data. What happens when you train a model on entirely high-quality synthetic and curated data?
As it turns out, it does very well on model benchmarks but disappoints in practice. Searching for the reception to each Phi model shows the same pattern: very impressive benchmarks, lots of enthusiasm, and then actual performance far weaker than the benchmarks would suggest.
I think the impressive benchmark results come from the fact that these models are very easy to train for specific tasks, because you generate much of the training data yourself. If you’re training on synthetic data, you’d be foolish not to generate some synthetic data that matches the kind of problems people are benchmarking on. But since you’re “teaching for the test”, you should expect to do worse than other language models who are training on broad data and end up being good at the benchmarks by accident.
Why am I talking about Phi models? At the end of 2024, Sebastien Bubeck left Microsoft to join OpenAI. We don’t know who was involved in making the new OpenAI gpt-oss models. The model card doesn’t provide much detail about the pretraining stage. However, I’d bet that Sebastien Bubeck was a part of the effort, and that these models were trained on a heavily filtered or synthetic dataset.
Synthetic data is safer
Why would OpenAI train Phi-style models, knowing that they’ll perform better on benchmarks than in real-world applications? For the same reason that Microsoft probably continued to train Phi-style models: safety. Releasing an open-source model is terrifying for a large organization. Once it’s out there, your name is associated with it forever, and thousands of researchers will be frantically trying to fine-tune it to remove the safety guardrails.
It’s not discussed publically very often, but the main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand. Any small online community for people who run local models is at least 50% perverts.
... continue reading