Skip to content
Tech News
← Back to articles

GLM-5.2 – How to Run Locally

read original more articles
Why This Matters

GLM-5.2 represents a significant advancement in open-source AI models, offering state-of-the-art performance comparable to proprietary models while being more resource-efficient. Its ability to run locally with various quantization options empowers developers and organizations to deploy powerful AI without relying on cloud services, enhancing privacy and reducing costs.

Key Takeaways

GLM-5.2 is Z.ai’s new open model, delivering SOTA performance across long-horizon coding, reasoning, and agentic tasks. With 744B parameters, 40B active parameters, and a 1M context window, it can now be run locally using GGUFs. GLM-5.2 is the strongest open model to date, performing on par with Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro across Artificial Analysis and many other benchmarks.

top-1 accuracy while being 86% smaller. Dynamic 2-bit reaches ~82% accuracy while being 84% smaller. This means the model is not 86% worse since it's 86% smaller - it rather is only ~24% less accurate than the full 1.5TB model. Thanks Z.ai for giving Unsloth day-zero access.

Run GLM-5.2 TutorialsQuantization Results

⚙️ Usage Guide

The 2-bit dynamic quant UD-IQ2_M uses 239GB of disk space - this can directly fit on a 256GB unified memory Mac and works well in a 1x24GB GPU and 256GB of RAM with MoE offloading. The 1-bit quant will fit on a 223GB RAM and 8-bit requires 810GB RAM.

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

1-bit 2-bit 3-bit 4-bit 5-bit 8-bit 223 GB 245 GB 290-360 GB 372-475 GB 570 GB 810 GB

Recommended Settings

GLM-5.2 has 3 thinking modes. Non-thinking and Thinking in two modes: High + Max. Use Max Thinking for complicated tasks. In you can easily toggle High + Max Thinking and non-Thinking with a UI.

Use these settings for most use cases:

... continue reading