SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning

📝 Paper Summary

Multi-agent systems Self-improving agents

SiriuS optimizes multi-agent collaboration by curating a library of successful interaction trajectories and repairing failed ones via ground-truth feedback to fine-tune agent policies.

Core Problem

Optimizing multi-agent systems is difficult because credit assignment across complex interactions is ambiguous, and acquiring specialized training data for diverse agents is challenging.

Why it matters:

Multi-agent systems often rely on fragile, manually designed prompts that do not generalize well.
Unlike single-agent settings, it is unclear how to attribute success or failure to specific intermediate steps in a collaborative dialogue.
Standard reinforcement learning methods struggle with the unstructured nature of language-based agent interactions.

Concrete Example: In a physics problem, a 'Physicist' agent might correctly identify a principle, but a 'Mathematician' agent might miscalculate the formula. A standard outcome reward simply says 'fail,' making it hard for the system to learn which specific agent needs improvement.

Key Novelty

Multi-Agent Experience Replay & Augmentation

Builds an 'experience library' by collecting successful reasoning trajectories from agent interactions, filtering by outcome reward.
Augments failed trajectories by using a ground-truth grounded critic to guide agents in regenerating correct steps, converting failures into useful training data.

Architecture

The SiriuS training pipeline: Iterative loop of action sampling, evaluation, library update, and fine-tuning.

Evaluation Highlights

Boosts performance by 2.86% to 21.88% on reasoning and biomedical QA tasks compared to baselines.
Enhances agent negotiation capabilities in competitive settings (Resource Exchange, Seller/Buyer scenarios).

Breakthrough Assessment

7/10

Proposes a principled, self-contained loop for multi-agent improvement without requiring dense human supervision. The reported gains (up to ~21%) are significant, though the method relies on ground-truth availability for the correction phase.

⚙️ Technical Details

Problem Definition

Setting: Multi-agent Markov Decision Process defined by tuple <S, A, T, R, N, G>

Inputs: Initial state s (problem statement) and communication graph G

Outputs: Joint actions a leading to a final solution or state

Pipeline Flow

Example Configuration (Science): Physicist Agent → Mathematician Agent → Summarizer Agent

System Modules

Physicist Agent

Analyzes the problem to extract domain-specific physical principles.

Model or implementation: LLM (Specific variant not reported in snippet)

Mathematician Agent

Formalizes the reasoning with quantitative models based on the Physicist's analysis.

Model or implementation: LLM (Specific variant not reported in snippet)

Summarizer Agent

Consolidates insights into a clear final answer.

Model or implementation: LLM (Specific variant not reported in snippet)

Novel Architectural Elements

Trajectory Augmentation Loop: Failed trajectories are not discarded but repaired via an external feedback mechanism (grounded in ground truth) to create 'corrected' synthetic data for SFT.

Modeling

Base Model: LLMs (Specific model family/size not reported in text snippet)

Training Method: Iterative Supervised Fine-Tuning (SFT) on self-generated 'Experience Library'

Training Data:

Successful trajectories (Reward > epsilon) added directly to library.
Failed trajectories repaired via feedback generation and regeneration, then added to library.

Compute: Not reported in the paper

Comparison to Prior Work

vs. STaR: SiriuS extends bootstrapping to multi-agent graphs where credit assignment is ambiguous, rather than single-chain reasoning.
vs. Standard Multi-Agent Debate: SiriuS explicitly fine-tunes the agents on successful debate trajectories, rather than just using debate at inference time.

Limitations

Relies on ground truth availability for the trajectory augmentation (repair) phase.
Credit assignment remains implicit; the system optimizes based on joint success rather than explicit per-agent rewards.
Specific improvements depend heavily on the quality of the 'critic' used for augmentation.

Reproducibility

Code availability is mentioned ('available here') but the URL is missing from the provided text snippet. Model hyperparameters and base model identities are not in the text snippet.

📊 Experiments & Results

Evaluation Setup

Multi-agent collaboration on reasoning and negotiation tasks.

Benchmarks:

Reasoning QA (Complex reasoning (Specific datasets not named in snippet, likely MATH/GSM8K))
Biomedical QA (Domain-specific QA)
Competitive Settings (Negotiation (Resource Exchange, Seller/Buyer, Ultimatum Game))

Metrics:

Accuracy
Win rate / Individual Reward (in competitive settings)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

SiriuS consistently improves multi-agent performance over single-agent and baseline methods, with gains ranging from 2.86% to 21.88%.
The framework effectively utilizes both successful (positive reinforcement) and corrected failed trajectories (negative reinforcement/correction) to train agents.
The approach generalizes to competitive settings where agents must balance cooperation and competition (e.g., negotiation).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, policies, rewards)
Supervised Fine-Tuning (SFT)
Language Models (LLMs)

Key Terms

SFT: Supervised Fine-Tuning—retraining a pre-trained model on a specific dataset to adapt its behavior.

Bootstrapping: A self-improving process where the model uses its own high-confidence outputs as training data to get better.

Trajectory: The sequence of states and actions (messages/reasoning steps) generated by agents during a problem-solving session.

Credit Assignment: The problem of determining which past action or agent is responsible for a final positive or negative outcome.

Experience Library: A repository of high-quality reasoning trajectories collected from successful agent interactions, used for training.

STaR: Self-Taught Reasoner—a baseline method that iteratively trains a single agent on its own correct reasoning chains.