Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, James Zou
arXiv.org
(2025)
AgentReasoningRL
📝 Paper Summary
Multi-agent systemsSelf-improving agents
SiriuS optimizes multi-agent collaboration by curating a library of successful interaction trajectories and repairing failed ones via ground-truth feedback to fine-tune agent policies.
Core Problem
Optimizing multi-agent systems is difficult because credit assignment across complex interactions is ambiguous, and acquiring specialized training data for diverse agents is challenging.
Why it matters:
Multi-agent systems often rely on fragile, manually designed prompts that do not generalize well.
Unlike single-agent settings, it is unclear how to attribute success or failure to specific intermediate steps in a collaborative dialogue.
Standard reinforcement learning methods struggle with the unstructured nature of language-based agent interactions.
Concrete Example:In a physics problem, a 'Physicist' agent might correctly identify a principle, but a 'Mathematician' agent might miscalculate the formula. A standard outcome reward simply says 'fail,' making it hard for the system to learn which specific agent needs improvement.
Key Novelty
Multi-Agent Experience Replay & Augmentation
Builds an 'experience library' by collecting successful reasoning trajectories from agent interactions, filtering by outcome reward.
Augments failed trajectories by using a ground-truth grounded critic to guide agents in regenerating correct steps, converting failures into useful training data.
Architecture
The SiriuS training pipeline: Iterative loop of action sampling, evaluation, library update, and fine-tuning.
Evaluation Highlights
Boosts performance by 2.86% to 21.88% on reasoning and biomedical QA tasks compared to baselines.
Proposes a principled, self-contained loop for multi-agent improvement without requiring dense human supervision. The reported gains (up to ~21%) are significant, though the method relies on ground-truth availability for the correction phase.
⚙️ Technical Details
Problem Definition
Setting: Multi-agent Markov Decision Process defined by tuple <S, A, T, R, N, G>
Inputs: Initial state s (problem statement) and communication graph G
Outputs: Joint actions a leading to a final solution or state
Analyzes the problem to extract domain-specific physical principles.
Model or implementation: LLM (Specific variant not reported in snippet)
Mathematician Agent
Formalizes the reasoning with quantitative models based on the Physicist's analysis.
Model or implementation: LLM (Specific variant not reported in snippet)
Summarizer Agent
Consolidates insights into a clear final answer.
Model or implementation: LLM (Specific variant not reported in snippet)
Novel Architectural Elements
Trajectory Augmentation Loop: Failed trajectories are not discarded but repaired via an external feedback mechanism (grounded in ground truth) to create 'corrected' synthetic data for SFT.
Modeling
Base Model: LLMs (Specific model family/size not reported in text snippet)
Training Method: Iterative Supervised Fine-Tuning (SFT) on self-generated 'Experience Library'
Training Data:
Successful trajectories (Reward > epsilon) added directly to library.
Failed trajectories repaired via feedback generation and regeneration, then added to library.
Compute: Not reported in the paper
Comparison to Prior Work
vs. STaR: SiriuS extends bootstrapping to multi-agent graphs where credit assignment is ambiguous, rather than single-chain reasoning.
vs. Standard Multi-Agent Debate: SiriuS explicitly fine-tunes the agents on successful debate trajectories, rather than just using debate at inference time.
Limitations
Relies on ground truth availability for the trajectory augmentation (repair) phase.
Credit assignment remains implicit; the system optimizes based on joint success rather than explicit per-agent rewards.
Specific improvements depend heavily on the quality of the 'critic' used for augmentation.
Reproducibility
Code availability is mentioned ('available here') but the URL is missing from the provided text snippet. Model hyperparameters and base model identities are not in the text snippet.
📊 Experiments & Results
Evaluation Setup
Multi-agent collaboration on reasoning and negotiation tasks.
Benchmarks:
Reasoning QA (Complex reasoning (Specific datasets not named in snippet, likely MATH/GSM8K))
Win rate / Individual Reward (in competitive settings)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
SiriuS consistently improves multi-agent performance over single-agent and baseline methods, with gains ranging from 2.86% to 21.88%.
The framework effectively utilizes both successful (positive reinforcement) and corrected failed trajectories (negative reinforcement/correction) to train agents.
The approach generalizes to competitive settings where agents must balance cooperation and competition (e.g., negotiation).
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning (MDPs, policies, rewards)
Supervised Fine-Tuning (SFT)
Language Models (LLMs)
Key Terms
SFT: Supervised Fine-Tuning—retraining a pre-trained model on a specific dataset to adapt its behavior.
Bootstrapping: A self-improving process where the model uses its own high-confidence outputs as training data to get better.
Trajectory: The sequence of states and actions (messages/reasoning steps) generated by agents during a problem-solving session.
Credit Assignment: The problem of determining which past action or agent is responsible for a final positive or negative outcome.
Experience Library: A repository of high-quality reasoning trajectories collected from successful agent interactions, used for training.
STaR: Self-Taught Reasoner—a baseline method that iteratively trains a single agent on its own correct reasoning chains.