Tech News
← Back to articles

Evaluating LLMs for my personal use case

read original related products more articles

Most models are excellent, so cost and latency dominate.

It’s great that AI can win maths Olympiads, but that’s not what I’m doing. I mostly ask basic Rust, Python, Linux and life questions. So I did my own evaluation.

I gathered 130 real prompts from my bash history (I use command line tool llm).

I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories. They both chose very similar ones, broadly (with examples):

Programming - “Write a bash script to ..”

Sysadmin - “With curl how do I ..”

how do I ..” Technical explanations - “Explain underlay networks in a data center”

General knowledge and creative tasks - “Recipe for blackened seasoning”

Then I had GPT-OSS-120B and GLM 4.5 pick three queries for each category from the 130 prompts. I used that to help me pick three entries per category, they are listed at the end.

I use Open Router everyday, and I used it for these evals. It’s the only place I know that has all the models, great prices, low latency, and a very sane API. I use my own fast and simple Rust CLI called ort.

... continue reading