Jun Rao, Xuebo Liu, Hexuan Deng, Zepeng Lin, Zixiong Yu, Jiansheng Wei, Xiaojun Meng, Min Zhang
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China,
Huawei Noah’s Ark Lab
arXiv
(2025)
RLReasoningBenchmark
📝 Paper Summary
Mathematical ReasoningData Selection for LLMs
SAI-DPO improves mathematical reasoning by iteratively selecting training data that aligns with the model's current self-aware difficulty and specifically targets knowledge points where the model is failing.
Core Problem
Existing data selection methods rely on static, external difficulty metrics that fail to adapt to a model's evolving capabilities and specific weaknesses during iterative training.
Why it matters:
Static metrics result in training on data that is either too easy (wasteful) or too hard (ineffective) as the model improves
Current approaches ignore the specific knowledge gaps of the model, treating all errors equally rather than targeting structural weaknesses
Training reasoning models is resource-intensive; improving data efficiency is critical for developing powerful models with constrained resources
Concrete Example:A model might be proficient in algebra but weak in geometry. A static sampler continues to feed it algebra problems it has already mastered, while a difficulty-blind sampler might feed it geometry problems that are impossibly hard, rather than those on the 'frontier' of its capability.
Key Novelty
Self-Aware Iterative Direct Preference Optimization (SAI-DPO)
defines 'Self-Aware Difficulty' using the model's own performance (Pass@K) and generation characteristics (step count, length) rather than external labels
Uses 'Knowledge Points Similarity' to cluster questions and dynamically up-weight clusters where the model currently fails, ensuring training focuses on active weaknesses
Architecture
Overview of the SAI-DPO pipeline, illustrating the offline data preparation and the online iterative loop of probing, dynamic sampling, and training.
Evaluation Highlights
Achieves an average performance boost of up to 21.3 percentage points across 8 mathematical reasoning benchmarks
+15 percentage points improvement on AMC23 (American Mathematics Competitions) compared to baselines
+10 percentage points improvement on AIME24 (American Invitational Mathematics Examination) compared to baselines
Breakthrough Assessment
7/10
Strong empirical gains (+21.3%) on hard math benchmarks suggest the dynamic sampling strategy is highly effective, though the core components (DPO, clustering, P@K) are established techniques combined in a novel loop.
⚙️ Technical Details
Problem Definition
Setting: Post-training alignment of Large Language Models (LLMs) for mathematical reasoning
Inputs: A pool of mathematical problems with ground truth answers
Outputs: A reasoning model policy capable of generating correct solution steps
Pipeline Flow
Data Prep: Knowledge Tagging & Clustering (Offline)
Iteration Start: Subset Probing → Error Analysis
Dynamic Sampling: Weighting by Similarity & Difficulty
Training: Iterative DPO Update
System Modules
Knowledge Tagger (Data Preparation)
Annotate problems with specific knowledge points (tags)
Model or implementation: DeepSeek-R1-Distill-Qwen-14B
Knowledge Clusterer (Data Preparation)
Group problems into domains based on knowledge point similarity
Model or implementation: Sentence-Transformers + K-Means
Subset Prober (Dynamic Sampling)
Assess current model competence on a representative subset
Model or implementation: Current Policy (Model being trained)
Dynamic Sampler (Dynamic Sampling)
Select training data from the full pool that matches current weaknesses
Model or implementation: Statistical Algorithm
DPO Trainer
Update model weights using preference optimization
Model or implementation: Current Policy (e.g., Qwen2.5-7B)
Novel Architectural Elements
Feedback-driven data selection loop where the 'Error Dataset' from a probe step dynamically alters the sampling distribution of the main training set in real-time
Modeling
Base Model: Qwen2.5-7B-Math-Base, Qwen2.5-7B-Distill, Llama3.1-8B-Instruct
Training Method: Iterative Direct Preference Optimization (DPO)
Objective Functions:
Purpose: Maximize likelihood of preferred responses while staying close to reference model.
Positive sample: Random correct answer generated by model
Negative sample: Random incorrect answer generated by model
Key Hyperparameters:
training_samples_per_iteration: 20000
subset_probe_size: 1% of data
DPO_beta: Not explicitly reported in the paper
Compute: Negligible overhead (<1% time for subset probing, <5% for initial tagging) compared to training time
Comparison to Prior Work
vs. S1/LIMO: SAI-DPO is dynamic and iterative, adapting to the model's changing state rather than using a static high-quality set
vs. KIMI K1.5: Adapts based on real-time 'self-aware' difficulty (P@K) rather than predefined curriculum schedules
Limitations
Relies on a stronger teacher model (DeepSeek-R1) for initial knowledge point annotation
Requires ground truth answers to verify correctness for P@K calculation (not applicable to open-ended tasks without verifiers)
Iterative process adds complexity compared to single-stage DPO
Reproducibility
Data selection methodology and formulas are provided. Base models (Qwen2.5, Llama3.1) and tagging model (DeepSeek-R1) are public. Code URL is not provided in the text. Exact hyperparameters (learning rate, epochs) are not detailed in the provided snippet.
📊 Experiments & Results
Evaluation Setup
Mathematical reasoning on competition-level and standard benchmarks
Benchmarks:
AIME24 (Competition Math)
AMC23 (Competition Math)
GSM8K (Grade School Math)
MATH (Competition Math)
Metrics:
Accuracy (Pass@1 implied by context of reasoning tasks)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Illustration of varying difficulty discrimination across different models, motivating the need for self-aware difficulty metrics.
Main Takeaways
Dynamic data selection yields significant gains over static baselines: SAI-DPO achieves up to 21.3 percentage points average improvement across 8 benchmarks.
The method is particularly effective on hard competition datasets, showing +10 points on AIME24 and +15 points on AMC23.
Self-aware difficulty (P@K, step count) combined with knowledge point targeting effectively identifies the 'frontier' of the model's capabilities, preventing wasted training on mastered or impossible problems.
📚 Prerequisite Knowledge
Prerequisites
Direct Preference Optimization (DPO)
Reinforcement Learning (RL) concepts
Clustering algorithms (K-Means)
Pass@K evaluation metric
Key Terms
DPO: Direct Preference Optimization—an algorithm that fine-tunes language models to align with preferences by optimizing a classification loss on preference pairs
SAI-DPO: Self-Aware Iterative Direct Preference Optimization—the proposed method that dynamically samples data based on model-specific difficulty and error similarity
P@K: Pass at K—a metric measuring the probability that at least one of K generated solutions is correct; used here as a proxy for problem difficulty
Knowledge Points: Specific mathematical concepts or domains (e.g., 'geometry', 'sequences') involved in a problem, used to cluster similar questions
Self-aware Difficulty: A measure of difficulty derived from the model's own behavior (success rate, output length, number of steps) rather than human labels
SFT: Supervised Fine-Tuning—training the model on correct examples before RL/DPO alignment