Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

📝 Paper Summary

Multi-call tool use with flexible plan RL-based tool use

SWiRL improves LLM performance on complex tasks by generating multi-step tool-use trajectories, filtering them for step-wise soundness, and optimizing the model via reinforcement learning on granular intermediate steps.

Core Problem

Traditional RLHF and RLAIF optimize for single-step responses, failing to address the compounding errors and complex planning required for multi-step reasoning and tool-use tasks.

Why it matters:

Complex real-world problems require sequences of interrelated actions (searching, synthesizing, calculating), where one early mistake derails the final answer
Current methods struggle to teach models when to stop searching or how to recover from intermediate errors in long trajectories
Existing single-step optimization approaches miss the granular feedback needed to correct specific faulty reasoning steps or tool calls

Concrete Example: In multi-hop QA, a model might incorrectly query a search engine in step 1, retrieve irrelevant info, and hallucinate an answer. Standard outcome-based RL only penalizes the final wrong answer, failing to teach the model *which* specific step (the bad query) caused the failure.

Key Novelty

Step-Wise Reinforcement Learning (SWiRL)

Decomposes multi-step synthetic trajectories into sub-trajectories, treating each intermediate action (reasoning or tool call) as a distinct training point
Uses a generative reward model to score the 'reasonableness' of each specific step given its context, rather than just scoring the final outcome
Demonstrates that filtering data based on step-wise process quality is more effective for RL than filtering solely for correct final answers

Architecture

The Step-Wise Reinforcement Learning (SWiRL) optimization process.

Evaluation Highlights

+21.5% relative accuracy improvement on GSM8K (math) and +12.3% on HotPotQA (multi-hop QA) compared to baseline approaches
Training solely on HotPotQA (text QA) improves zero-shot performance on GSM8K (math) by 16.9%, demonstrating strong cross-task generalization
Outperforms baselines by 15.3% on BeerQA and 11.1% on MuSiQue, confirming effectiveness across diverse multi-hop reasoning datasets

Breakthrough Assessment

8/10

Significant gains in multi-step reasoning and remarkable cross-task generalization (QA to Math). The finding that process-filtered data outperforms outcome-filtered data for RL challenges prevailing assumptions in synthetic data distillation.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning and tool use where a model generates a trajectory of states and actions to solve a query

Inputs: Original prompt/question q

Outputs: Sequence of actions a_1...a_K (reasoning thoughts, tool calls, final answer)

Pipeline Flow

Inference Loop: Prompt → Model Generation → Action Parsing → Tool Execution → Context Update → Repeat

System Modules

Base Model (Policy)

Generates reasoning trace and decides next action (tool call or final answer)

Model or implementation: Gemma-2-27b-it (also evaluated 2b and 9b)

Action Parser (Execution)

Detects tool tags (<<<search_query>>> or <<<math_exp>>>) and extracts content

Model or implementation: Rule-based parser

Tool Executor (Execution)

Executes external calls (Search or Calculator) and returns results

Model or implementation: External APIs (Gecko-based retriever or SymPy interpreter)

Novel Architectural Elements

Granular sub-trajectory decomposition for training: The trajectory is broken into K independent training samples (s_i, a_i) where s_i includes full history, allowing step-specific optimization rather than whole-trajectory scoring

Modeling

Base Model: Gemma-2 (2B, 9B, and 27B variants)

Training Method: Step-Wise Reinforcement Learning (SWiRL)

Objective Functions:

Purpose: Maximize expected sum of step-wise rewards.

Formally: J(theta) = E[Sum_{s,a in T} R(a|s)]

Training Data:

50,000 synthetic trajectories from HotPotQA (5 per question)
37,500 synthetic trajectories from GSM8K
Filtered via 'Process filtering' using Gemini 1.5 Pro as judge

Key Hyperparameters:

learning_rate: 1e-6 (Gemma-2-9b), 5e-7 (Gemma-2-27b)
batch_size: 32 (Gemma-2-9b), 16 (Gemma-2-27b)
kl_penalty: Not explicitly reported in the paper
+ 1 more
steps: 200-300

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1/STaR: SWiRL uses process-based filtering (step correctness) rather than outcome-based filtering (final answer correctness), finding the former superior for RL
vs. RLHF/RLAIF: Optimizes at the step level (intermediate actions) rather than the trajectory level
vs. DQO: Optimizes step-level actions rather than token-level actions
+ 1 more
vs. OREO: Does not require training a separate value network or iterative co-optimization

Limitations

Relies on a strong proprietary model (Gemini 1.5 Pro) for data generation and process reward labeling
Process filtering is computationally expensive compared to outcome filtering
Smaller models (2B/9B) show limited generalization compared to 27B model
No direct comparison to online RL methods like PPO with process rewards

Reproducibility

Not provided: Code URL, specific prompt templates for the reward model, or trained weights. Available: Base models (Gemma-2) and datasets (HotPotQA, GSM8K) are public. Closed-source dependency: Uses Gemini 1.5 Pro for data generation and reward modeling.

📊 Experiments & Results

Evaluation Setup

Multi-step QA and Math reasoning with tool access (Search, Calculator)

Benchmarks:

HotPotQA (Multi-hop Question Answering)
GSM8K (Mathematical Reasoning)
BeerQA (Multi-hop QA)
MuSiQue (Multi-hop QA)
CofCA (Code/Reasoning)

Metrics:

Accuracy (Exact Match or equivalent)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SWiRL outperforms standard Supervised Finetuning (SFT) and Base models across multiple datasets when trained on in-domain data.
HotPotQA	Accuracy	57.3	64.4	+7.1
GSM8K	Accuracy	73.3	76.4	+3.1
Cross-task generalization experiments show that training on one domain (e.g., QA) improves performance on distinct domains (e.g., Math).
GSM8K	Accuracy	62.9	73.5	+10.6
HotPotQA	Accuracy	51.1	55.8	+4.7
Ablation on filtering strategies reveals that Process Filtering (step-wise soundness) is superior to Outcome Filtering (final answer correctness) for SWiRL.
HotPotQA	Accuracy	57.3	64.4	+7.1

Experiment Figures

Performance on GSM8K (Math) when training on GSM8K (In-Domain) vs. training on HotPotQA (Out-of-Domain), compared to Base and SFT.

Comparison of filtering strategies (No Filter, Outcome Filter, Process Filter) for SFT vs. SWiRL on HotPotQA.

Main Takeaways

Process filtering is critical for RL: Models learn best from trajectories with sound reasoning steps, even if the final outcome is incorrect. This contrasts with SFT, which requires correct outcomes.
Strong cross-task generalization: Learning granular reasoning steps transfers between disparate tasks (e.g., Math to QA), suggesting the model learns a general 'how to reason' capability rather than just task-specific patterns.
SWiRL generalizes to out-of-distribution datasets: Training on HotPotQA yields double-digit relative gains on BeerQA (+15.3%), MuSiQue (+11.1%), and CofCA (+14.8%).
Model scale matters for generalization: While smaller models (2B/9B) improve on in-domain tasks, only the larger 27B model shows strong cross-domain transfer capabilities.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy gradient, reward modeling)
Large Language Models (LLMs) and tool use/function calling
Synthetic data generation and filtering strategies

Key Terms

SWiRL: Step-Wise Reinforcement Learning—the proposed method of optimizing models on granular sub-trajectories using step-level rewards

Chain of Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

Process Reward Model (PRM): A reward model that evaluates the quality of intermediate reasoning steps, rather than just the final answer

Outcome Reward Model (ORM): A reward model that evaluates only the correctness of the final answer

SFT: Supervised Fine-Tuning—training a model on labeled examples using standard log-likelihood maximization

Gecko: A text embedding model used for retrieval in the tool-use pipeline

SymPy: A Python library for symbolic mathematics, used as the calculator tool

REINFORCE: A basic policy gradient RL algorithm (part of the PPO family logic often used in LLMs) used to optimize the expected reward