GLM-4.5: Reasoning, Coding, and Agentic Abililties

Today, we introduce two new GLM family members: GLM-4.5 and GLM-4.5-Air — our latest flagship models. GLM-4.5 is built with 355 billion total parameters and 32 billion active parameters, and GLM-4.5-Air with 106 billion total parameters and 12 billion active parameters. Both are designed to unify reasoning, coding, and agentic capabilities into a single model in order to satisfy more and more complicated requirements of fast rising agentic applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models, offering: thinking mode for complex reasoning and tool using, and non-thinking mode for instant responses. They are available on Z.ai, Z.ai API and open-weights are avaiable at HuggingFace and ModelScope. Background: LLM always targets at achieving human-level cognitive capabilities across a wide range of domains, rather than designed for specific tasks. As a good LLM model, it is necessary to deal with general problem solving, generalization, commen sense reasoning, and self-improvement. In the past five years, OpenAI's GPT-3 learns commen-sense knowledge, and o1 uses reinforcement learning to think before respond, significantly improving reasoning skills in coding, data analysis, and complex math. However, the resultant models are still not really general: some of them are good at coding, some good at math, and some good at reasoning, but none of them could achieve the best performance across all the different tasks. GLM-4.5 makes efforts toward the goal of unifying all the different capabilities. Overall Performance We compare GLM-4.5 with various models from OpenAI, Anthropic, Google DeepMind, xAI, Alibaba, Moonshot, and DeepSeek on 12 benchmarks covering agentic (3), reasoning (7), and Coding (2). Overall, GLM-4.5 is ranked at the 3rd place and GLM-4.5 Air is ranked at the 6th.

Agentic Tasks GLM-4.5 is a foundation model optimized for agentic tasks. It provides 128k context length and native function calling capacity. We measure its agent ability on 𝜏-bench and BFCL-v3 (Berkeley Function Calling Leaderboard v3). On both benchmarks, GLM-4.5 matches the performance of Claude 4 Sonnet.

Web browsing is a popular agentic application that requires complex reasoning and multi-turn tool using. We evaluate GLM-4.5 on the BrowseComp benchmark, a challenging benchmark for web browsing that consists of complicated questions that expect short answers. With access to the web browsing tool, GLM-4.5 gives correct answers for 26.4% of all questions, clearly outperforming Claude-4-Opus (18.8%) and close to o4-mini-high (28.3%). Below the figure shows the test-time scaling accuracy of GLM-4.5 on the BrowseComp.

All the detailed results of different comparison models on the three Benchmarks used for eveluating model agent ability are listed in the following table. Benchmark GLM-4.5 GLM-4.5-Air o3 o4-mini-high GPT-4.1 Claude 4 Opus Claude 4 Sonnet Gemini 2.5 Pro Qwen3 235B Thinking 2507 DeepSeek-R1-0528 DeepSeek V3 0324 Kimi K2 Grok 4 TAU-bench 70.1 69.4 61.2 57.4 62.0 70.5 70.3 62.5 73.2 58.7 57.6 62.6 67.5 BFCL v3 (Full) 77.8 76.4 72.4 67.2 68.9 61.8 75.2 61.2 72.4 63.8 64.7 71.1 66.2 BrowseComp 26.4 21.3 49.7 28.3 4.1 18.8 14.7 7.6 4.6 3.2 1.5 7.9 32.6 Reasoning Under the thinking mode, GLM-4.5 and GLM-4.5-Air can solve complex reasoning problems including mathematics, science, and logical problems. Benchmark GLM-4.5 GLM-4.5-Air o3 Claude 4 Opus Gemini 2.5 Pro DeepSeek-R1-0528 Qwen3-235B-Thinking 2507 Grok 4 MMLU Pro 84.6 81.4 85.3 87.3 86.2 84.9 84.5 86.6 AIME24 91.0 89.4 90.3 75.7 88.7 89.3 94.1 94.3 MATH 500 98.2 98.1 99.2 98.2 96.7 98.3 98.0 99.0 SciCode 41.7 37.3 41.0 39.8 42.8 40.3 42.9 45.7 GPQA 79.1 75.0 82.7 79.6 84.4 81.3 81.1 87.7 HLE 14.4 10.6 20.0 11.7 21.1 14.9 15.8 23.9 LiveCodeBench (2407-2501) 72.9 70.7 78.4 63.6 80.1 77.0 78.2 81.9 AA-Index (Estimated) 67.7 64.8 70.0 64.4 70.5 68.3 69.4 73.2 For the AIME and GPQA benchmarks, we report the average accuracy over 32 and 8 samples respectively (Avg@32, Avg@8) to mitigate result variance. An LLM was used for automated answer validation. For the HLE benchmark, only the text-based questions were evaluated, with correctness judged by gpt-4o. Coding GLM-4.5 excels at coding, including both building coding projects from scratch and agentically solving coding tasks in existing projects. It can be seamlessly combined with existing coding toolkits such as Claude Code, Roo Code, and CodeGeex. To evaluate the coding capability, we compared different models on SWE-bench Verified and Terminal Bench. The following table presents the results. Benchmark GLM-4.5 GLM-4.5-Air o3 GPT-4.1 Claude 4 Opus Claude 4 Sonnet Gemini 2.5 Pro DeepSeek-R1-0528 Kimi K2 SWE-bench Verified1 64.2 57.6 69.1 48.6 67.8 70.4 49.0 41.4 65.4 Terminal-Bench2 37.5 30 30.2 30.3 43.2 35.5 25.3 17.5 25.0 1 For SWE-bench Verified, we use OpenHands v0.34.0 with runs limited to 100 iterations and history truncation to prevent exceeding the 128K context limit, configured with temperature=0.6, top_p=1.0. 2 For Terminal-Bench, we use the Terminus framework for evaluation. We use standard function calling rather than direct prompting for evaluation. We conducted a Pareto Frontier analysis for all comparison models (as illustrated in the figure below). GLM-4.5 and GLM-4.5-Air demonstrate superior performance relative to models of comparable scale, achieving optimal efficiency on the performance-scale trade-off boundary.

GLM-4.5 demonstrates comprehensive full-stack development capabilities, enabling seamless creation of web applications that encompass frontend implementation, database management, and backend deployment. The frontend interfaces generated by GLM-4.5 exhibit enhanced functionality and aesthetic appeal, demonstrating strong alignment with human design preferences. Furthermore, GLM-4.5 exhibits superior performance in generating presentation materials, including slides and posters, with capabilities significantly augmented when integrated with agentic tools for information retrieval and contextual enhancement.

To assess GLM-4.5's agentic coding capabilities, we utilized Claude Code to evaluate performance against Claude-4-Sonnet, Kimi K2, and Qwen3-Coder across 52 coding tasks spanning frontend development, tool development, data analysis, testing, and algorithm implementation. All evaluations were performed in isolated testing environments through multi-round human interaction with standardized evaluation criteria to ensure consistency and reproducibility. The empirical results demonstrate that GLM-4.5 achieves a 53.9% win rate against Kimi K2 and exhibits dominant performance over Qwen3-Coder with an 80.8% success rate. While GLM-4.5 shows competitive performance, further optimization opportunities remain when compared to Claude-4-Sonnet.

Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data. RL for Large-Scale Models with slime

To facilitate the highly efficient Reinforcement Learning (RL) training required for large-scale models such as GLM-4.5, we have designed, developed, and open-sourced slime. This RL infrastructure is engineered for exceptional flexibility, efficiency, and scalability, and we actively encourage community use and contributions. slime's primary innovations are architected to overcome common RL bottlenecks, particularly in complex agentic tasks. • Flexible Hybrid Training Architecture: slime’s core strength is its versatile hybrid architecture. It supports both synchronous, co-located training, ideal for traditional applications like Reasoning and General RL, as well as a disaggregated, asynchronous training mode. This asynchronous paradigm is critical for advanced agentic RL, where data generation can be a slow, external process. By decoupling training from data collection, it ensures our training GPUs remain fully saturated, maximizing hardware utilization.

• Decoupled Agent-Oriented Design: Agentic RL often suffers from slow and long-tail latency distributions during environment rollouts, which severely throttles training throughput. To address this, slime implements a fully decoupled infrastructure that separates rollout engines from training engines. These components operate independently on distinct hardware, transforming the data generation bottleneck into a parallelized, non-blocking process. This design is fundamental to accelerating long-horizon agent tasks.

• Accelerated Data Generation with Mixed Precision: To further boost throughput, slime features accelerated rollouts using mixed-precision inference. It strategically employs the highly efficient FP8 format for data generation while retaining the stability of BF16 for the model training loop. This technique dramatically increases data generation speed without compromising training quality. This cohesive design allows slime to seamlessly integrate multiple agent frameworks, support diverse tasks, and efficiently manage long-horizon rollouts through a unified, powerful interface. Post-Training with Reinforcement Learning for Agentic Capabilities