GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

📝 Paper Summary

Foundation Models Agentic AI Reasoning

GLM-4.5 is an open-source Mixture-of-Experts model that unifies agentic, reasoning, and coding capabilities through a hybrid reasoning approach and multi-stage post-training with verifiable rewards.

Core Problem

Achieving unified mastery of complex reasoning, coding, and agentic interaction in a single open-source model remains elusive, as most open models trail proprietary leaders (like o1/o3 or Claude 3.5 Sonnet) in these specific ARC domains.

Why it matters:

Proprietary models dominate high-stakes tasks like mathematical reasoning and software engineering, limiting open research access.
Existing open models often specialize in one area (e.g., just coding or just math) rather than acting as general problem solvers.
Effective agentic behaviors require tight integration of reasoning with tool use, which is difficult to train without specialized feedback loops.

Concrete Example: In web browsing tasks involving elusive, interwoven facts, standard models often fail to navigate or filter information effectively. GLM-4.5 uses a specialized data synthesis pipeline for multi-step web search to train the model to persist through difficult retrieval tasks.

Key Novelty

Unified ARC (Agentic, Reasoning, Coding) via Hybrid Reasoning

Combines 'thinking' mode (deliberative reasoning for complex tasks) and 'direct response' mode within a single model architecture.
Utilizes a massive post-training pipeline that iteratively distills specialized experts (Reasoning, Agent, General) back into a unified generalist model.
Implements difficulty-based curriculum learning in RL, dynamically adjusting problem complexity and sampling temperature to prevent training plateaus.

Architecture

The training pipeline (Pre-training -> Mid-training) and implicitly the model structure via parameter tables.

Evaluation Highlights

Achieves 70.1% on TAU-Bench (Agentic), matching Claude Sonnet 4 performance.
Scored 91.0% on AIME 24 (Math Reasoning), surpassing GPT-4.1 and Qwen3-235B-Thinking.
Attains 64.2% on SWE-bench Verified (Coding), outperforming GPT-4.1 and Gemini-2.5-pro.

Breakthrough Assessment

9/10

Significant leap for open weights. Matches or beats top proprietary models (Claude Sonnet 4, GPT-4o) on key hard benchmarks (AIME, SWE-bench) with fewer parameters than competitors like Llama 405B or DeepSeek V3.

⚙️ Technical Details

Problem Definition

Setting: General-purpose language modeling with specific optimization for long-horizon agentic tasks, mathematical reasoning, and repository-level coding.

Inputs: Natural language prompts, code repositories, or tool outputs (up to 128K context)

Outputs: Text, code, or structured tool calls (up to 64K output length)

Pipeline Flow

User Prompt → MoE Transformer Layers (with Thinking/Direct Mode) → MTP Head → Output

System Modules

Input Embedding & RoPE

Tokenizes input and applies partial Rotary Positional Embeddings

Model or implementation: Transformer Embedding

MoE Layers

Process tokens using routed experts for efficient computation

Model or implementation: 89 layers (GLM-4.5) / 45 layers (Air), 160/128 experts total

MTP Layer

Predicts multiple future tokens to support speculative decoding

Model or implementation: 1 MoE layer specialized for multi-token prediction

Novel Architectural Elements

Deep & Narrow MoE: Reduced width (5120 hidden dim) but increased depth (89 layers) compared to DeepSeek-V3/Kimi K2 to enhance reasoning.
High Head Count: 96 attention heads (2.5x standard for this width), found to improve reasoning benchmarks despite neutral loss impact.
MTP Layer Integration: Incorporates a specific MoE layer for Multi-Token Prediction to aid speculative decoding.

Modeling

Base Model: GLM-4.5 (355B total, 32B active) and GLM-4.5-Air (106B total, 12B active)

Training Method: Multi-stage RL (GRPO) + Iterative Self-Distillation

Objective Functions:

Purpose: Optimize policy using group-relative rewards without KL term.

Formally: L_RL(θ) = E[1/K * Sum(r(x,y_i) - r_mean)]
Purpose: Enforce correct function call formats.

Formally: Reward = 1 if FormatCorrect AND Match(ground_truth), else 0.
Purpose: Multi-token prediction loss for speculative decoding.

Formally: MTP loss weight λ = 0.3 (initial) -> 0.1 (late).

Adaptation: Full model training (MoE parameters)

Training Data:

Pre-training: 23T tokens (Web, Code, Math/Science)
Mid-training: Repo-level code, synthetic reasoning (500B), Long context (100B)
Post-training: SFT data (millions of samples), RL data (math, code, agent trajectories)

Key Hyperparameters:

max_sequence_length: 131,072 (mid-training/SFT)
learning_rate: 2.5e-4 (peak) -> 2.5e-5 (end)
batch_size: 64M tokens
+ 4 more
weight_decay: 0.1
optimizer: Muon
Newton_Schulz_iterations: 5
momentum: 0.95

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-V3: GLM-4.5 uses a deeper/narrower architecture (89 layers vs 61) and higher head count (96 vs 128 relative to dim) for better reasoning.
vs. DeepSeek-R1: GLM-4.5 combines reasoning (thinking) and agentic capabilities in one model, whereas R1 is primarily reasoning-focused.
vs. Kimi K2: GLM-4.5 has ~1/3 the parameters (355B vs 1043B) but competitive/superior performance.
+ 1 more
vs. Qwen-2.5 [not cited in paper]: GLM-4.5 integrates Multi-Token Prediction (MTP) directly into the architecture for speculative decoding, unlike standard Qwen.

Limitations

Computational cost of 'thinking' mode increases inference latency.
Requires complex multi-stage training pipeline (Expert construction -> Unified distillation).
Function calling templates require specific XML-like formatting to reduce escaping issues.

Reproducibility

Code: https://github.com/zai-org/GLM-4.5

publicly available (https://huggingface.co/zai-org/GLM-4.5). Model weights and evaluation toolkit (glm-simple-evals) are released. Training code and specific datasets are not released.

📊 Experiments & Results

Evaluation Setup

Evaluated on 12 benchmarks covering Agentic, Reasoning, and Coding (ARC) tasks.

Benchmarks:

TAU-Bench (Agentic (Retail/Airline))
AIME 24 (Mathematical Reasoning)
SWE-bench Verified (Software Engineering/Coding)
LiveCodeBench (2407-2501) (Coding)
BrowseComp (Web Browsing Agent)

Metrics:

Pass Rate
Accuracy
Percent Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Agentic benchmarks showing GLM-4.5 competitive with top proprietary models.
TAU-Bench	Score (%)	70.1	70.1	0.0
BrowseComp	Score (%)	18.8	26.4	+7.6
Reasoning benchmarks demonstrating strong math performance.
AIME 24	Accuracy (%)	82.4	91.0	+8.6
GPQA	Accuracy (%)	66.0	79.1	+13.1
Coding benchmarks showing efficiency on real-world tasks.
SWE-bench Verified	Score (%)	51.1	64.2	+13.1
Terminal-Bench	Score (%)	28.5	37.5	+9.0

Experiment Figures

Radar charts comparing GLM-4.5 against competitors (Claude, GPT-4, DeepSeek) on Agentic, Coding, and Reasoning axes.

Scatter plot of SWE-bench Verified Score vs. Model Parameters.

Training curves for Difficulty-based Curriculum Learning on AIME 24.

Main Takeaways

Ranked 3rd overall across 12 ARC benchmarks, behind only OpenAI o3 and DeepSeek-R1-0528 (in some metrics), but with significantly fewer parameters than DeepSeek-R1.
Demonstrates that a single model can handle both 'thinking' (slow reasoning) and 'direct' (fast chat) modes effectively via hybrid training.
RL strategies like token-weighted mean loss for code and difficulty-based curriculum for math are critical for post-training efficiency.
Single-stage RL at max context length (64K) outperforms multi-stage length extension, avoiding 'unlearning' of long-context capabilities.

📚 Prerequisite Knowledge

Prerequisites

Mixture-of-Experts (MoE) architecture
Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF)
Group Relative Policy Optimization (GRPO)
Rejection Sampling

Key Terms

MoE: Mixture-of-Experts—a model architecture where different sub-networks (experts) are activated for different inputs, increasing capacity without increasing inference cost.

MTP: Multi-Token Prediction—a training objective where the model predicts multiple future tokens simultaneously to improve reasoning and enable speculative decoding.

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt, removing the need for a separate value network.

SFT: Supervised Fine-Tuning—training a model on labeled examples to teach it specific behaviors or formats.

CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer.

ARC: Agentic, Reasoning, and Coding—the three core capabilities targeted by this model family.

Muon optimizer: A specialized optimizer for neural networks designed to accelerate convergence and handle large batch sizes efficiently.

RoPE: Rotary Positional Embeddings—a method for encoding position information in transformer models.

Self-distillation: A process where a stronger version of a model (e.g., trained via RL) generates data to train a new base version of itself.

Pareto Frontier: The set of optimal solutions where no objective can be improved without sacrificing another; here referring to the trade-off between model size and performance.