SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Synthetic Data Generation Mathematical Reasoning

SwS improves LLM reasoning by using the model's own RL failure cases to identify weaknesses, then synthesizing targeted, difficulty-calibrated practice problems to address those specific deficiencies.

Core Problem

Existing RLVR datasets are scarce or unverified, and standard augmentation techniques (rephrasing/sampling) create problems that are either too easy (mastered) or too hard (unsolvable), leading to inefficient training due to vanishing gradients in group-relative policy optimization.

Why it matters:

High-quality, verifiable math problems for RL are expensive to label manually.
Training on problems the model already knows or consistently fails offers zero learning signal (advantage collapses to 0) in group-level RL algorithms like GRPO.
Blindly expanding datasets without targeting model weaknesses leads to inefficient compute usage and suboptimal reasoning gains.

Concrete Example: If a model consistently fails all geometry proofs involving 'circle theorems' (0% accuracy across RL epochs), standard augmentation might generate more algebra problems (which it already knows) or extremely hard calculus problems (which it can't solve). SwS detects the 'circle theorem' failure, extracts that concept, and synthesizes solvable geometry variants to specifically target that weakness.

Key Novelty

Self-aware Weakness-driven Problem Synthesis (SwS)

Uses the policy model's own training trajectory to identify 'weaknesses'—problems where accuracy is consistently low or declining.
Extracts core concepts from these failure cases and recombines them to synthesize new problems targeted at these specific weak areas.
Allocates synthesis budget based on the failure rate of different categories and filters generated problems by difficulty to ensure they are learnable (neither trivial nor impossible).

Architecture

The SwS framework pipeline: (1) Self-aware Weakness Identification via preliminary RL, (2) Targeted Problem Synthesis using concept extraction, and (3) Augmented Training.

Evaluation Highlights

Achieves average absolute improvement of +10.0% accuracy for the 7B model across eight mathematical reasoning benchmarks compared to the base model.
Achieves average absolute improvement of +7.7% accuracy for the 32B model across eight benchmarks.
Enables the model to solve up to 20.0% more previously failed problems in weak domains compared to training only on the original dataset.

Breakthrough Assessment

8/10

Strong empirical results (+7-10%) demonstrating that targeted synthetic data based on self-identified weaknesses is significantly more effective than general augmentation. Addresses a critical efficiency bottleneck in RLVR.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning

Inputs: Initial problem set X_S containing math problems from various categories

Outputs: Augmented policy model trained on both original and synthesized weakness-targeting problems

Pipeline Flow

Weakness Identification: Run preliminary RL on initial set → Track failure rates
Problem Synthesis: Extract concepts from failures → Recombine concepts → Generate new problems (Instruction Model)
Verification & Filtering: Generate Reference Answers (Reasoning Model) → Filter by Consistency & Difficulty
Augmented Training: Train Policy Model on Original + Synthetic sets

System Modules

Policy Model (Learner)

The model being trained; executes preliminary RL to reveal weaknesses and final RL on augmented data.

Model or implementation: Models ranging from 3B to 32B (e.g., Qwen2.5-Math-7B, Qwen2.5-Math-32B)

Instruction Model (Generator)

Generates new problem text based on recombined concepts from failure cases.

Model or implementation: Strong instruction model (specific architecture not detailed in text, likely GPT-4 or similar high-capacity model)

Reasoning Model (Labeler) (Verification)

Generates reference answers for synthetic problems via self-consistency.

Model or implementation: QwQ-32B

Evaluator Model (Judge) (Verification)

Quality check for solvability and factual accuracy.

Model or implementation: General instruction LLM

Novel Architectural Elements

Closed-loop data synthesis pipeline where the synthesis target is dynamically determined by the learner's specific failure modes in RL training (Failure Rate metric)
Difficulty-based filtering loop that uses the learner itself to verify if synthetic problems are within the 'learnable' range [25%, 75%] accuracy

Modeling

Base Model: Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-32B-Instruct, and others (3B-32B range)

Training Method: Group Relative Policy Optimization (GRPO) with modifications (DAPO-style)

Objective Functions:

Purpose: Optimize policy to maximize reward using group-relative advantages.

Formally: standard GRPO objective.
Purpose: Improve training efficiency by filtering prompts based on difficulty.

Formally: Only train on prompts where accuracy is between acc_lower and acc_upper.
Purpose: Stabilize updates without KL term.

Formally: Clip importance sampling ratio at upper threshold epsilon.

Training Data:

Initial set X_S: Mathematical problems from diverse categories
Synthetic set X_T': Generated based on weakness concepts, filtered for quality and difficulty

Key Hyperparameters:

difficulty_range: [0.25, 0.75]
reference_answer_consistency_threshold: 50%

Compute: Not reported in the paper

Comparison to Prior Work

vs. General Augmentation: SwS targets specific failure cases rather than indiscriminate expansion.
vs. Rephrasing: SwS synthesizes entirely new problems via concept recombination rather than just rewording.
vs. Standard Curriculum Learning [not cited in paper]: SwS dynamically generates the curriculum based on RL failure trajectories rather than pre-sorting static data.

Limitations

Relies on high-quality external models (Instruction Model, Reasoning Model) for synthesis and verification, which may be costly.
Requires a preliminary RL training phase to identify weaknesses, adding computational overhead.
Current implementation excludes multiple-choice and proof-based questions.
Effectiveness depends on the 'learnability' of the weakness; some concepts might be fundamentally beyond the model's capacity regardless of data quantity.

Reproducibility

Code availability is not provided in the text. Key prompts for concept-based problem generation are in Appendix J. The method relies on external models (Instruction Model, Reasoning Model QwQ-32B) for synthesis and labeling.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks across varying difficulties.

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging math problems)
AIME 2024 (Competition math)
AMC 2023 (Competition math)
OlympiadBench (Olympiad-level math)
CollegeMath (College-level math)
GaoKao 2023 (Chinese college entrance exam)
CMATH (Chinese math benchmark)

Metrics:

Accuracy (pass@1)
Failure Rate Reduction
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SwS significantly improves performance across multiple model sizes compared to base models and models trained only on the original dataset (Source Only).
Average (8 benchmarks)	Accuracy	46.2	56.2	+10.0
Average (8 benchmarks)	Accuracy	55.8	63.5	+7.7
MATH	Accuracy	65.3	69.1	+3.8
AIME 2024	Accuracy	46.7	53.3	+6.6
Analysis of failure recovery shows SwS specifically helps fix previously 'unlearnable' problems.
Weak Domain Problems	Solved Rate of Previously Failed Problems	Not reported in the paper	20.0	Not reported in the paper

Experiment Figures

Accuracy trends of a model on different problems during training, illustrating the definition of 'weaknesses'.

Main Takeaways

Targeting weaknesses is more efficient than indiscriminate data scaling; SwS achieves better results by focusing on what the model doesn't know.
Difficulty filtering is crucial for RLVR; ensuring synthetic problems are within the 'learnable' zone prevents gradient vanishing.
The method scales effectively from 3B to 32B models, suggesting robustness across capabilities.
Weakness-driven synthesis allows models to 'self-heal' by converting persistent failures into solvable training signal.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts, specifically Policy Optimization
Large Language Models (LLMs) for reasoning
Synthetic Data Generation

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using RL where the reward is determined by a deterministic check of the final answer (e.g., math problems).

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, eliminating the need for a separate value function critic.

SwS: Self-aware Weakness-driven Problem Synthesis—the proposed framework.

Failure Rate: A metric identifying weaknesses; defined as problems where the model never reaches 50% accuracy and shows a negative accuracy slope over training epochs.

Concept Recombination: The process of taking keywords/topics from failed problems (e.g., 'geometry', 'area') and combining them to prompt a generator for new questions.

Math-Verify: A tool or library used to rigorously check if a generated math answer matches the ground truth, handling different formats (fractions vs decimals).

Self-consistency: A method where a model generates multiple answers, and the most frequent answer is selected as the pseudo-label.

Pass@K: A metric measuring the probability that at least one of K generated samples is correct.

SFT: Supervised Fine-Tuning—training on labeled data before RL.

KL term: Kullback-Leibler divergence—a penalty term often used in RL to keep the new policy close to the old one; omitted in this paper's optimization.