GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to estimate advantages without a value function
Dr. MAS: The proposed method; stands for a stable RL training recipe for Multi-Agent Systems
gradient-norm inflation: A phenomenon where the magnitude of gradient updates becomes excessively large, destabilizing training
heterogeneous agent-model assignment: Assigning different LLM sizes or types to different agent roles (e.g., a small model for drafting, a large model for verifying)
score function: The gradient of the log-probability of the policy, used in policy gradient algorithms
importance sampling ratio: The ratio between the probability of an action under the current policy versus the old policy, used to correct for off-policy data
Micro-batches: Subsets of the training data processed separately to manage memory or compute constraints
vLLM/sglang: High-throughput inference engines for serving Large Language Models