Runing GLM-5.2 on local hardware

GLM-5.2 is Z.ai’s new open model, delivering SOTA performance across long-horizon coding, reasoning, and agentic tasks. With 744B parameters, 40B active parameters, and a 1M context window, it can now be run locally using GGUFs. GLM-5.2 is the strongest open model to date, performing on par with Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro across Artificial Analysis and many other benchmarks.

top-1 accuracy while being 86% smaller. Dynamic 2-bit reaches ~82% accuracy while being 84% smaller. In other words, the model is not 86% worse despite being 86% smaller; it is only ~24% less accurate than the full 1.5TB model. Thanks Z.ai for giving Unsloth day-zero access.

Run GLM-5.2 TutorialsQuantization Results

⚙️ Usage Guide

The 2-bit dynamic quant UD-IQ2_M uses 239GB of disk space - this can directly fit on a 256GB unified memory Mac and works well in a 1x24GB GPU and 256GB of RAM with MoE offloading. The 1-bit quant will fit on a 223GB RAM and 8-bit requires 810GB RAM.

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

1-bit 2-bit 3-bit 4-bit 5-bit 8-bit 223 GB 245 GB 290-360 GB 372-475 GB 570 GB 810 GB

Recommended Settings

GLM-5.2 has 3 thinking modes. Non-thinking and Thinking in two modes: High + Max. Use Max Thinking for complicated tasks. In you can easily toggle High + Max Thinking and non-Thinking with a UI.

Use these settings for most use cases:

... continue reading