Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
For anyone who has been taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning’s two big model releases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic.
Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin)—transcript here:
And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript):
I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!
I tried Opus a second time passing thinking_level: max . It didn’t do much better (transcript):
I don’t think Qwen are cheating
A lot of people are convinced that the labs train for my stupid benchmark. I don’t think they do, but honestly this result did give me a little glint of suspicion. So I’m burning one of my secret backup tests—here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for “Generate an SVG of a flamingo riding a unicycle”:
Qwen3.6-35B-A3B
(transcript) Opus 4.7
... continue reading