When I was in college, my data structures professor told a story. It went something like this:
"When I was your age, I received an assignment, and encountered an inexplicable bug. I debugged and debugged and found that adding a print statement resolved the bug. I was young like all of you, and I was certain I'd found a bug in the C compiler. Turns out the problem was me."
The takeaway was clear: if you have a bug, it's your fault.
This is a good heuristic for most cases, but with open source ML infrastructure, you need to throw this advice out the window. There might be features that appear to be supported but are not. If you're suspicious about an operation or stage that's taking a long time, it may be implemented in a way that's efficient enough…for an 8B model, not a 1T+ one. HuggingFace is good, but it's not always correct. Libraries have dependencies, and problems can hide several layers down the stack. Even Pytorch isn't ground truth.
Over the past couple months, I worked on developing infrastructure to post-train and serve models cheaply. Ultimately, my team decided to develop a custom training codebase, but only after I spent a few days attempting to use existing open-source options. The following is an account of my successes and failures and what it means for open-weights models.
Making it work
The goal is to post-train Kimi-K2-Thinking. My success criteria is both qualitative and quantitative: loss should go down and the model should change behavior in line with the dataset we train on.
It’s an open source model, so surely there should be some training code online. But it turns out there isn’t really any. LLaMA-Factory + KTransformers is supposed to support it, but I encountered a bunch of bugs. Also, it’s designed for CPU offloading + GPU training, which adds unnecessary complexity and is inefficient.
What about HuggingFace? It has basically everything. Kimi-k2-thinking is available along with a config and modeling class which seems to support and implement the model. The HuggingFace model info doesn’t say whether training is supported, but HuggingFace’s Transformers library supports models in the same architecture family, such as DeepSeek-V3. The fundamentals seem to be there; we might need some small changes, but how hard can it be?
First, we need a dataset for which we’ll be able to tell if the model has trained. Let's create one that will make our model talk like Yoda. We can get a bunch of questions from TriviaQA, and generate responses by prompting an LLM to answer the question while pretending it’s Yoda. Running the script, I get a few thousand prompts and responses that look something like this:
... continue reading