We ran a set of popular open-source models against our IDOR benchmark, the same dataset and the same prompt we've used to evaluate frontier coding agents. The result surprised us: GLM 5.2, an open-weight model from Zhipu AI, scored a 39% F1 on IDOR detection, beating Claude Code (32%) at roughly $0.17 per vulnerability found. It still trailed Semgrep's multimodal pipeline (53–61% F1), but that pipeline runs in a purpose-built harness that does a lot of the heavy lifting. Among models given nothing but a prompt, the best open-weight option was no longer the obvious underdog, beating out Claude Opus 4.8.
We weren't trying to crown an open-weight champion, really. We were trying to answer a narrower, more boring question: how much of vulnerability-detection performance comes from the model, and how much comes from the harness around it? For us at Semgrep this is a very important question as we speak to customers who are leveraging AI agents heavily in their security tasks. A harness is the scaffolding that wraps a model: it feeds it the repository, decides what it sees, parses its output, and loops it through a task. Our internal multimodal pipeline runs inside a harness, which is purpose-built for static analysis. We have been testing this internally for a while with a workflow for finding IDORs or Insecure Direct Object References. These are access control issues which can roughly be thought of as “you’re accessing something belonging to another user”.
Our harness enumerates the application's endpoints, and code trying to sift through only the important context, and then points the model directly at them. That's a lot of structure, but remember when I said we really didn’t mean to answer the what’s-the-best-open-weight-model? The models in this test don’t get that, they run in a simple Pydantic AI harness with the same IDOR prompt we give every other LLM-provider model, no endpoint discovery, no guided navigation, we did give it a bit of help, just a little more than "here's the code, find the bugs.", offering a search strategy and some pointers on what IDORs look like.
So this started as a prompting-versus-harness experiment, but while we were running it we were genuinely shocked. One of the open-weight models, with none of our scaffolding, surpassed a frontier coding agent.
Introducing GLM-5.2
If you’ve not heard of GLM-5.2, don’t worry, neither had we until we saw it on social media and thought to add it to our benchmarks. GLM 5.2 is the latest model from Zhipu AI (Z.ai), rolled out to its GLM Coding Plan members on Saturday, June 13, 2026, with the open weights and release notes following three days later on June 16 (which is when we heard about it). Three things make it interesting for security work.
First, it’s open weight. That means the model's parameters are published under an MIT license, which means you can download them, run them on your own hardware, fine-tune them, and inspect them. For a lot of security teams working in sensitive areas that’s important, an open-weight model can run entirely inside your own environment. But it’s important to note that "open weight" is not the same as "open source", the trained weights are released, but the training data and full pipeline generally are not (though Z.ai does publish its RL training framework).
Second, it's genuinely competitive on coding. GLM 5.2 is a Mixture-of-Experts (MoE) model with roughly 750 billion total parameters but only about 40 billion active per token, which keeps inference cost down relative to its size. It extends the usable context from 200K all the way to 1M tokens, and Z.ai's pitch is that this context stays reliable across long, messy agent trajectories, not just that it accepts more input. Again for security tasks this is important, as security tasks for things like IDORs must be able to reason across different files, through an authorization framework. On standard coding benchmarks it posts the strongest open-weight numbers going: 81.0 on Terminal-Bench 2.1 (versus 63.5 for GLM 5.1, and within a few points of Claude Opus 4.8's 85.0) and 62.1 on SWE-bench Pro, edging out closed frontier models and trailing the very top by single-digit percentages.
Third, cost. Tokenomics is quickly becoming as important as the LLM capabilities themselves. Reported pricing lands around one-sixth of comparable frontier models and commentators who track open models closely have compared GLM 5.2's reception to DeepSeek. GLM-5.2 arrived at a charged time not just due to tokenomics but also landing just after frontier-class closed models hit new export restrictions after reported jailbreaks. One detail from the release notes is worth flagging for anyone pointing this model at code: Z.ai reports that GLM 5.2 exhibits more reward-hacking behavior than GLM 5.1, during training it would do things like read protected evaluation files or curl reference solutions to inflate its score, prompting them to build a dedicated anti-hacking guard. It’s an honest disclosure by the team, but if you were building a model for hacking, well… you can’t get more hacker than trying to bypass the tests in the first place.
Our Experiment
... continue reading