GLM-5.2 Hits 1M Tokens for Coding Agents, Trails Claude 4.8 by 1%

GLM-5.2 Opens 1M-Token Context with Architecture Optimizations

ZhipuAI released GLM-5.2, a 1M-token context model designed for long-horizon coding tasks. The model introduces IndexShare, an architecture change that reuses the same sparse-attention indexer across every four transformer layers, reducing per-token FLOPs by 2.9× at 1M context (company-reported). It also improves the speculative-decoding layer (MTP) to raise acceptance length by up to 20%.

The model carries an MIT open-source license with no regional restrictions.

Benchmark Positioning

On three long-horizon coding benchmarks, GLM-5.2 ranks as the highest-ranked open-source entry (company-reported). On FrontierSWE (open-ended technical projects spanning systems optimization and applied ML research), it trails Claude Opus 4.8 by 1%, edges out GPT-5.5 by 1%, and beats Opus 4.7 by 11%. On PostTrainBench (post-training via H100), it ranks second only to Opus 4.8, outperforming both Opus 4.7 and GPT-5.5. On SWE-Marathon (ultra-long-horizon tasks like compiler building and kernel optimization), it ranks second to Opus, trailing by 13%.

On standard short-context coding benchmarks, GLM-5.2 improves sharply over GLM-5.1: 81.0 vs. 63.5 on Terminal-Bench 2.1 and 62.1 vs. 58.4 on SWE-bench Pro (company-reported). On Terminal-Bench 2.1, it lands within a few points of Opus 4.8 (85.0) and ahead of Gemini 3.1 Pro.

The model introduces effort-level control, allowing users to balance capability against latency and cost. At comparable token budgets, GLM-5.2 delivers stronger coding performance than GLM-5.1, with capability positioned between Opus 4.7 and 4.8 under similar token consumption.

Training and Infrastructure

GLM-5.2 uses an internal framework called slime for agentic reinforcement learning, handling multi-domain tasks, tool use, sub-task decomposition, and multi-turn environment feedback. The team merged ten expert models into the final model via parallel offline preference distillation training in approximately two days (company-reported). Inference optimization focuses on KV-cache capacity, long-context kernel overhead, and CPU-side scheduling to handle longer prompts under GPU resource constraints.

The team implemented an anti-hack module for RL training and evaluation to prevent reward hacking in coding tasks (e.g., agents downloading solutions via curl or reading protected evaluation artifacts). The module uses a rule-based filter followed by LLM-judge verification to separate real task-solving from shortcuts.

Open-Source Long-Context Baseline Matters for Agent Builders

A 1M-token open-source model that maintains quality under real coding-agent trajectories changes the cost-and-control calculus for teams building agentic systems. Teams that cannot send long traces to Claude or GPT-4o (due to privacy, cost, or latency constraints) now have a deployable alternative with published long-horizon benchmarks.

The gap to Claude Opus 4.8 is real (1% to 13% depending on benchmark), but the absolute performance and the ability to run inference on owned hardware matter more than closing percentage points. The risk: benchmarks favor in-distribution coding tasks; production agent behavior under novel environments or tool chains may not transfer.

Audit Your Agent Trace Length and Latency Budget

If your agent trajectories exceed 200K tokens, run a head-to-head test on GLM-5.2 vs. your current production model (Claude, GPT-4, Llama) using a subset of real tasks or a staging environment. Measure both final-task accuracy and per-token latency under your expected concurrency. If the 1M context allows you to fit full traces without chunking or pruning, the inference-engine optimizations may yield throughput gains that offset any small capability loss. If your traces stay under 128K, the upgrade likely does not justify re-evaluating vendor contracts.

Test effort-level control on expensive or high-stakes tasks. The ability to dial compute up or down per request can reduce average cost if most tasks run at baseline and only a subset require max effort.

GLM-5.2 Hits 1M Tokens for Coding Agents, Trails Claude 4.8 by 1%

Our Take

Why it matters

Do this week

GLM-5.2 Opens 1M-Token Context with Architecture Optimizations

Benchmark Positioning

Training and Infrastructure

Open-Source Long-Context Baseline Matters for Agent Builders

Audit Your Agent Trace Length and Latency Budget

Related stories

Your Change Plans Need AI Strategy Now, Reuters Says

68% of law firms deploy Harvey AI agents; power users save 11 hours weekly

GLP-1 drugs reach only 2-3% of Europeans who qualify, despite proven ROI