Skip to content
Tech News
← Back to articles

Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions

read original more articles
Why This Matters

This project demonstrates that a small, fine-tuned local language model like Qwen 3:0.6B can effectively categorize household questions, enhancing the efficiency of knowledge retrieval systems. Such advancements are significant for developing privacy-conscious, resource-efficient AI solutions tailored for personal or small-scale applications.

Key Takeaways

As a fun personal project, I have been working on a chatbot for answering general questions about my household on anything from maintenance questions to doctor’s appointments.

The general idea is that the chatbot will get its household knowledge through RAG from querying a vector database, but for better results I have made the vector searches metadata aware.

Basically, I am running questions through a pre-processing step to categorize questions into known metadata categories (e.g. pool, car, hvac, cooking). The main goal of this is to narrow down the search space for vector ranking to only indexed entries that match the category of the question. As an example, the question “When did we replace our pool pump?” will be mapped to a category called “pool” before querying the Index database.

The hypothesis I want to test in this experiment is whether a very small local LLM can be fine-tuned to perform reliable question categorization when trained on a dataset of household-related questions

LLMs

In this project I am using two different local llms – Qwen 3:4B and Qwen 3:0.6B. The 4B parameter version is used for general question answering, while the super tiny 0.6B version is used to categorize questions. The whole premise of this experiment is to see if a tiny llm with only 600M parameters can be finetuned into a reliable classifier of household questions.

Finetuning

For finetuning I am using a popular open-source framework called Unsloth, which seems well suited for tuning local models like Qwen and Llama.

For training purposes my initial dataset consists of about ~850 data entries where I do a 70/15/15 percentage-based split into training data, eval data and test data respectively. Training data and eval data are used during training, while the test dataset is withheld and used to run a test post training. See section below for sample data:

[ { "question": "Who cleans our gutters at the house?", "category": "gutters" }, { "question": "Who serviced the hot water heater for the home?", "category": "water heater" }, { "question": "Who fixed the sprinkler system in the yard?", "category": "irrigation" }, { "question": "Which store do we usually buy pinnekjott from?", "category": "cooking" }, { "question": "What dimensions are the air filters for the home AC?", "category": "hvac" }, { "question": "What year did we replace the downstairs AC unit?", "category": "hvac" } ]

... continue reading