Most models are excellent, so cost and latency dominate.
It’s great that AI can win maths Olympiads, but that’s not what I’m doing. I mostly ask basic Rust, Python, Linux and life questions. So I did my own evaluation.
I gathered 130 real prompts from my bash history (I use command line tool llm).
I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories. They both chose very similar ones, broadly (with examples):
Programming - “Write a bash script to ..”
Sysadmin - “With curl how do I ..”
how do I ..” Technical explanations - “Explain underlay networks in a data center”
General knowledge and creative tasks - “Recipe for blackened seasoning”
Then I had GPT-OSS-120B and GLM 4.5 pick three queries for each category from the 130 prompts. I used that to help me pick three entries per category, they are listed at the end.
I use Open Router everyday, and I used it for these evals. It’s the only place I know that has all the models, great prices, low latency, and a very sane API. I use my own fast and simple Rust CLI called ort.
... continue reading