GLM-5.1 — first open-weight model to lead SWE-Bench Pro

Z.ai's 744B MoE shipped MIT, hit 58.4 on SWE-Bench Pro, narrowly beats GPT-5.4 and Claude Opus 4.6 on that benchmark. Crucial caveat: at 466 GB Q4 it's hosted-only realistic. The 'open-weight' framing matters less than the SWE-Bench Pro leadership.

Verdict: First open-weight to top SWE-Bench Pro; MIT but hosted-only realistic

The take

Z.ai (formerly Zhipu AI) shipped GLM-5.1 on April 7 — 744B total / 40B active MoE, MIT license, 200K context (131K max output). The headline benchmark: SWE-Bench Pro 58.4, narrowly leading GPT-5.4 (57.x) and Claude Opus 4.6 (56.x) on that specific long-horizon coding eval. First open-weight model to top a frontier-level coding benchmark cleanly.

Why MIT-license framing matters less than expected: at 744B params, GLM-5.1 at Q4 is ~466 GB on disk. That's not a single-card model, not a single-server model, and not a Mac Studio model. Realistic deployment is hosted (Z.ai's own API or via OpenRouter) or workstation-class multi-GPU. Ollama ships a `glm-5.1:cloud` tag for hosted inference — the honest path for consumer hardware.

Unsloth dynamic 2-bit (220 GB) and 1-bit (200 GB) GGUFs do exist for workstation-class local runs, but at heavy quantization the SWE-Bench Pro leadership likely degrades. The headline number was measured at full precision via Z.ai's own infra; community testing hasn't yet confirmed how much of that holds at 1-bit quant.

Where it sits: agents.top alongside Kimi K2.6 and Qwen 3.6-35B-A3B. Use GLM-5.1 via API or `:cloud` tag for the SWE-Bench Pro leadership when it matters. For local-first agentic coding, Qwen3-Coder-30B-A3B (single-card 24 GB) or Qwen 3.6-35B-A3B (mixed coding + chat) remain the realistic picks.

Where this fits

Models: GLM-5.1 · Kimi K2.6 · Qwen 3.6-35B-A3B · Qwen3-Coder-30B-A3B

Hardware: NVIDIA DGX Spark · Dual RTX 5090

Sources

Next step

Try this in the planner→