Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Lifan Yuan, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang
NVIDIA
arXiv
(2025)
RLReasoning
📝 Paper Summary
Math ReasoningLLM AlignmentOnline Learning
NFT enables language models to learn from their own errors via a supervised objective that implicitly models a negative policy, achieving performance comparable to reinforcement learning without explicit reward maximization.
Core Problem
Supervised Learning (SL) methods like Rejection Fine-Tuning (RFT) typically discard incorrect self-generated answers, preventing models from reflecting on mistakes, while Reinforcement Learning (RL) handles them but adds complexity.
Why it matters:
Throwing away negative data wastes valuable feedback signals that are critical for self-reflection and general intelligence
The prevailing view that 'self-improvement is exclusive to RL' creates an artificial divide between SL and RL methodologies
Existing SL methods hit a competence ceiling by reinforcing only what the model already knows (positive data) rather than correcting what it gets wrong
Concrete Example:In RFT, if a model generates 1 correct solution and 9 incorrect ones, it trains only on the 1 correct solution and ignores the 9 mistakes. NFT uses the 9 mistakes to explicitly push the model's probability distribution away from those errors.
Key Novelty
Negative-aware Fine-Tuning (NFT)
Constructs an 'implicit negative policy'—a mathematical re-parameterization of the target positive model—to model the distribution of incorrect answers
Derives a supervised-style loss function for negative data that directly optimizes the positive model by minimizing the likelihood of errors, weighted by their probability ratio
Demonstrates that this supervised approach is theoretically equivalent to the RL algorithm GRPO (Group Relative Policy Optimization) in strict on-policy settings
Architecture
The conceptual framework of NFT compared to RFT and RL. It illustrates how NFT uses the 'Implicit Negative Policy' to derive a learning signal from negative answers.
Evaluation Highlights
Learning from positive data (RFT) contributes ~80% of total performance gain in 32B models, while negative data (NFT) contributes the remaining ~20%
NFT matches or surpasses state-of-the-art RL algorithms like GRPO and DAPO across 7B and 32B model scales
Proves theoretical equivalence: NFT and GRPO loss gradients are identical when training is strictly on-policy (epsilon <= 1)
Breakthrough Assessment
8/10
Significantly bridges the gap between SL and RL, offering a simpler supervised framework that achieves RL-level performance and providing theoretical proof of their equivalence in on-policy settings.
⚙️ Technical Details
Problem Definition
Setting: Math reasoning where a policy (LLM) generates answers that are judged by a binary verifier
Inputs: Math question q
Outputs: Answer a
Pipeline Flow
Generator (produces K answers)
Verifier (scores answers)
Data Splitter (separates positive/negative)
Optimizer (updates model using NFT objective)
System Modules
Generator
Generate K candidate answers for a given question using the current policy
Model or implementation: Qwen2.5-Math (7B or 32B)
Verifier
Determine the correctness of each generated answer
Model or implementation: External binary verifier (rule-based or model-based)
Optimizer
Update model weights using both positive and negative samples
Model or implementation: Same Qwen2.5-Math model
Novel Architectural Elements
Dual-objective loss function combining standard maximum likelihood for positive data with a negative-likelihood ratio loss for negative data
Implicit negative policy parameterization allowing direct optimization on negative samples without a separate reward model
Modeling
Base Model: Qwen2.5-Math-7B and Qwen2.5-32B
Training Method: Negative-aware Fine-Tuning (NFT)
Objective Functions:
Purpose: Maximize likelihood of correct answers (standard SL).
Formally: L_pos = - log pi_theta(a|q)
Purpose: Minimize likelihood of incorrect answers weighted by their probability ratio (Implicit Negative Policy).
Formally: L_neg = - log ( (1 - r_q * R_theta) / (1 - r_q) ) where R_theta is likelihood ratio vs old policy
Adaptation: Full fine-tuning
Training Data:
DAPO-Math-17k dataset (17k math questions)
Data generated online: K answers per question per iteration
prompt_weighting: Dependent on correctness rate r_q (e.g., sqrt((1-r)/r))
Compute: Single model copy in memory (memory efficient)
Comparison to Prior Work
vs. RFT: NFT utilizes negative samples (mistakes) which RFT discards
vs. GRPO: NFT is supervised (no explicit advantage estimation needed), but theoretically equivalent in on-policy settings; handles off-policy clipping differently (soft decay vs hard clip)
vs. PPO [not cited in paper]: NFT avoids training a separate value network/critic, unlike standard PPO
Limitations
Relies on the availability of a binary verifier or ground truth answers
Performance gain over RFT is smaller in smaller models (7B) compared to larger models (32B)
Strict equivalence to GRPO only holds in on-policy settings; off-policy behavior diverges
Reproducibility
Code availability is not explicitly provided in the text. Training uses the public DAPO-Math-17k dataset. The method relies on online generation and verification, standard in recent reasoning logic.
📊 Experiments & Results
Evaluation Setup
Online fine-tuning on math problems with no external teacher (self-generated data)
Benchmarks:
AIME 2024 (Math Competition)
AIME 2025 (Math Competition)
AMC 2023 (Math Competition)
MATH500 (Math Problem Solving)
OlympiadBench (Olympiad Math)
Minerva Math (Math Reasoning)
Metrics:
Average Accuracy across benchmarks
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Average (32B Model Settings)
Percentage of total gain
100
20
-80
Experiment Figures
Entropy analysis of generated distributions over training steps for RFT vs NFT.
Main Takeaways
NFT consistently outperforms RFT, demonstrating that negative feedback (mistakes) contains valuable signal ignored by standard supervised learning.
The performance gap between RFT and RL is largely due to SL's neglect of negative data; bridging this gap allows SL to match RL methods like GRPO.
Negative feedback becomes increasingly important for larger models (32B vs 7B), suggesting that as models improve at memorization, reflection on errors becomes the new bottleneck.
RFT remains a very strong baseline, accounting for ~80% of the possible gains in the tested configurations.
Prioritizing harder questions (lower correctness rate) via weighting enhances performance.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)
RFT: Rejection Fine-Tuning—an SL method that generates multiple answers, filters for correctness, and fine-tunes the model only on the correct ones
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs for the same input
NFT: Negative-aware Fine-Tuning—the proposed supervised method that learns from both positive and negative self-generated data
implicit negative policy: A theoretical construct representing the distribution of incorrect answers, parameterized using the positive target model and the base model
on-policy: Training setting where the data is collected using the current version of the model being optimized
importance sampling: A technique to estimate properties of a target distribution using samples from a different (old) distribution
DAPO: Direct Advantage Preference Optimization—a recent RL alignment method for reasoning