Evaluating LLMs for my personal use case

Most models are excellent, so cost and latency dominate.

It’s great that AI can win maths Olympiads, but that’s not what I’m doing. I mostly ask basic Rust, Python, Linux and life questions. So I did my own evaluation.

I gathered 130 real prompts from my bash history (I use command line tool llm).

I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories. They both chose very similar ones, broadly (with examples):

Programming - “Write a bash script to ..”

Sysadmin - “With curl how do I ..”

how do I ..” Technical explanations - “Explain underlay networks in a data center”

General knowledge and creative tasks - “Recipe for blackened seasoning”

Then I had GPT-OSS-120B and GLM 4.5 pick three queries for each category from the 130 prompts. I used that to help me pick three entries per category, they are listed at the end.

I use Open Router everyday, and I used it for these evals. It’s the only place I know that has all the models, great prices, low latency, and a very sane API. I use my own fast and simple Rust CLI called ort.

... continue reading