Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

📝 Paper Summary

Multi-Agent Reinforcement Learning LLM Post-training Agent Orchestration

Dr. MAS stabilizes multi-agent reinforcement learning by normalizing advantages per agent rather than globally, preventing gradient spikes caused by diverse reward distributions among specialized agents.

Core Problem

Applying group-based RL (like GRPO) to multi-agent systems is unstable because a global normalization baseline fails when agents have heterogeneous reward distributions.

Why it matters:

Specialized agents (e.g., planners vs. executors) naturally operate in different reward ranges; forcing a global mean causes gradient explosions.
Current frameworks (veRL, ROLL) lack native support for efficient multi-agent orchestration and shared resource scheduling for co-training multiple LLMs.
Instability prevents effective post-training of complex multi-agent systems required for advanced reasoning and tool-use tasks.

Concrete Example: In a search task, a 'Search' agent might receive lower average rewards than an 'Answer' agent. Using a global mean, the Search agent's updates are consistently biased, inflating gradient norms and causing the model to collapse (e.g., Qwen2.5-7B stops calling search tools entirely).

Key Novelty

Agent-Wise Advantage Normalization (Dr. MAS)

Instead of normalizing rewards using a global group mean/variance, each agent normalizes its advantages using only its own reward statistics.
This theoretically bounds the second moment of the gradient for each agent, eliminating the variance inflation caused by reward distribution shifts between agents.

Architecture

The Dr. MAS system framework including orchestration, resource pooling, and agent-wise optimization.

Evaluation Highlights

+5.6% avg@16 and +4.6% pass@16 improvement on math reasoning benchmarks over vanilla GRPO using Qwen3 models.
+15.2% avg@16 and +13.1% pass@16 improvement on multi-turn search tasks over vanilla GRPO using Qwen2.5 models.
Restores performance of Qwen2.5-7B in non-shared search settings (from 28.0% to 43.8%), whereas vanilla GRPO collapsed due to gradient instability.

Breakthrough Assessment

8/10

Identifies a fundamental theoretical flaw in applying GRPO to multi-agent settings and provides a simple, rigorous fix that yields significant stability and performance gains.

⚙️ Technical Details

Problem Definition

Setting: Cooperative multi-agent LLM system with K agents optimizing a joint trajectory reward R via Group Relative Policy Optimization (GRPO).

Inputs: Task instruction x sampled from distribution p(X).

Outputs: Joint interaction trajectory containing states, actions, and agent identifiers, resulting in a terminal reward R.

Pipeline Flow

Trajectory Collector (distributes rollout)
Multi-Agent Orchestra (governs roles/flow)
Resource Pool Manager (schedules LLM backends)
Trainer (optimizes policies)

System Modules

Multi-Agent Orchestra

Dynamically selects and invokes agent policies based on current state or prior outputs

Model or implementation: User-defined logic (e.g., loop or hierarchy)

Resource Pool Manager

Manages shared pool of LLM backends (sglang) and routes requests

Model or implementation: Various LLMs (Qwen2.5, Qwen3)

Trainer

Computes agent-wise normalized advantages and updates policies

Model or implementation: Optimizer (e.g., AdamW)

Novel Architectural Elements

Agent-wise advantage normalization module integrated into the GRPO loss calculation
Shared resource pooling that decouples logical agents from physical LLM backends, allowing flexible co-training of heterogeneous models (e.g., 7B and 3B mixed)

Modeling

Base Model: Qwen2.5 (3B/7B) and Qwen3 (4B/8B)

Training Method: Dr. MAS (Agent-wise Group Relative Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward using importance sampling with agent-specific normalization.

Formally: E[ min( rho * A_norm, clip(rho, 1-eps, 1+eps) * A_norm ) ] where A_norm uses per-agent mean/std.

Training Data:

Math: DAPO-Math corpus (diverse problems with verifiable solutions)
Search: Mixture of NQ and HotpotQA

Key Hyperparameters:

group_size_math: 8
group_size_search: 5
max_turn_search: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: Dr. MAS uses per-agent normalization statistics to prevent gradient explosion
vs. veRL: Dr. MAS adds native multi-agent orchestration and shared resource pooling for heterogeneous agents
vs. MAPPO [not cited in paper]: Dr. MAS avoids learning a centralized value function (critic), relying on group-based relative advantages instead

Limitations

No statistical significance tests reported
Evaluation limited to math reasoning and search tasks; broader agentic domains (coding, OS control) not tested
Relies on verifiable rewards, which may not be available for all open-ended multi-agent tasks

Reproducibility

Code: https://github.com/langfengQ/DrMAS

Code is publicly available at https://github.com/langfengQ/DrMAS. The paper specifies model versions (Qwen2.5/3), datasets (DAPO-Math, NQ, HotpotQA), and group sizes. Hyperparameters like learning rates are mentioned as being agent-specific configurable but exact values are not in the main text.

📊 Experiments & Results

Evaluation Setup

Multi-agent interactions for Math (Solver+Verifier) and Search (Verifier+Search+Answer).

Benchmarks:

AIME 2024 / 2025 (Math Reasoning)
MATH500 (Math Reasoning)
HotpotQA (Multi-hop Search/QA)
NQ (Natural Questions) (Single-hop Search/QA)

Metrics:

avg@16 (Average outcome over 16 samples)
pass@16 (Probability at least one is correct)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math reasoning results comparing Dr. MAS to GRPO with Qwen3 models in shared and non-shared settings.
AIME 24	avg@16	42.7	54.8	+12.1
Average (Math)	avg@16	57.5	61.1	+3.6
Search task results showing Dr. MAS prevents collapse in non-shared settings.
Average (Search)	avg@16	28.0	43.8	+15.8
Ablation study on normalization components using Qwen2.5-7B (Search Non-Sharing).
Average (Search)	avg@16	28.0	43.8	+15.8

Experiment Figures

Conceptual illustration of Gradient-Norm Inflation. Left: Global baseline leads to large gradients for agents far from mean. Right: Agent-wise baseline centers rewards, reducing gradient variance.

Gradient norm curves during training for Search agents (Verifier, Search, Answer).

Main Takeaways

Dr. MAS consistently improves performance over vanilla GRPO across both math and search domains, with gains up to +15% in unstable settings.
The method effectively eliminates gradient spikes observed in vanilla GRPO, leading to smoother training dynamics.
Non-shared model settings (independent weights per agent) are most vulnerable to GRPO instability; Dr. MAS is critical here.
Heterogeneous assignment (e.g., 7B Verifier + 3B Search/Answer) maintains performance comparable to all-7B homogeneous systems while reducing computational cost.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Group Relative Policy Optimization (GRPO)
Multi-Agent Systems (MAS)
LLM Post-training infrastructure

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to estimate advantages without a value function

Dr. MAS: The proposed method; stands for a stable RL training recipe for Multi-Agent Systems

gradient-norm inflation: A phenomenon where the magnitude of gradient updates becomes excessively large, destabilizing training

heterogeneous agent-model assignment: Assigning different LLM sizes or types to different agent roles (e.g., a small model for drafting, a large model for verifying)

score function: The gradient of the log-probability of the policy, used in policy gradient algorithms

importance sampling ratio: The ratio between the probability of an action under the current policy versus the old policy, used to correct for off-policy data

Micro-batches: Subsets of the training data processed separately to manage memory or compute constraints

vLLM/sglang: High-throughput inference engines for serving Large Language Models