NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

📝 Paper Summary

Math Reasoning LLM Alignment Online Learning

NFT enables language models to learn from their own errors via a supervised objective that implicitly models a negative policy, achieving performance comparable to reinforcement learning without explicit reward maximization.

Core Problem

Supervised Learning (SL) methods like Rejection Fine-Tuning (RFT) typically discard incorrect self-generated answers, preventing models from reflecting on mistakes, while Reinforcement Learning (RL) handles them but adds complexity.

Why it matters:

Throwing away negative data wastes valuable feedback signals that are critical for self-reflection and general intelligence
The prevailing view that 'self-improvement is exclusive to RL' creates an artificial divide between SL and RL methodologies
Existing SL methods hit a competence ceiling by reinforcing only what the model already knows (positive data) rather than correcting what it gets wrong

Concrete Example: In RFT, if a model generates 1 correct solution and 9 incorrect ones, it trains only on the 1 correct solution and ignores the 9 mistakes. NFT uses the 9 mistakes to explicitly push the model's probability distribution away from those errors.

Key Novelty

Negative-aware Fine-Tuning (NFT)

Constructs an 'implicit negative policy'—a mathematical re-parameterization of the target positive model—to model the distribution of incorrect answers
Derives a supervised-style loss function for negative data that directly optimizes the positive model by minimizing the likelihood of errors, weighted by their probability ratio
Demonstrates that this supervised approach is theoretically equivalent to the RL algorithm GRPO (Group Relative Policy Optimization) in strict on-policy settings

Architecture

The conceptual framework of NFT compared to RFT and RL. It illustrates how NFT uses the 'Implicit Negative Policy' to derive a learning signal from negative answers.

Evaluation Highlights

Learning from positive data (RFT) contributes ~80% of total performance gain in 32B models, while negative data (NFT) contributes the remaining ~20%
NFT matches or surpasses state-of-the-art RL algorithms like GRPO and DAPO across 7B and 32B model scales
Proves theoretical equivalence: NFT and GRPO loss gradients are identical when training is strictly on-policy (epsilon <= 1)

Breakthrough Assessment

8/10

Significantly bridges the gap between SL and RL, offering a simpler supervised framework that achieves RL-level performance and providing theoretical proof of their equivalence in on-policy settings.

⚙️ Technical Details

Problem Definition

Setting: Math reasoning where a policy (LLM) generates answers that are judged by a binary verifier

Inputs: Math question q

Outputs: Answer a

Pipeline Flow

Generator (produces K answers)
Verifier (scores answers)
Data Splitter (separates positive/negative)
Optimizer (updates model using NFT objective)

System Modules

Generator

Generate K candidate answers for a given question using the current policy

Model or implementation: Qwen2.5-Math (7B or 32B)

Verifier

Determine the correctness of each generated answer

Model or implementation: External binary verifier (rule-based or model-based)

Optimizer

Update model weights using both positive and negative samples

Model or implementation: Same Qwen2.5-Math model

Novel Architectural Elements

Dual-objective loss function combining standard maximum likelihood for positive data with a negative-likelihood ratio loss for negative data
Implicit negative policy parameterization allowing direct optimization on negative samples without a separate reward model

Modeling

Base Model: Qwen2.5-Math-7B and Qwen2.5-32B

Training Method: Negative-aware Fine-Tuning (NFT)

Objective Functions:

Purpose: Maximize likelihood of correct answers (standard SL).

Formally: L_pos = - log pi_theta(a|q)
Purpose: Minimize likelihood of incorrect answers weighted by their probability ratio (Implicit Negative Policy).

Formally: L_neg = - log ( (1 - r_q * R_theta) / (1 - r_q) ) where R_theta is likelihood ratio vs old policy

Adaptation: Full fine-tuning

Training Data:

DAPO-Math-17k dataset (17k math questions)
Data generated online: K answers per question per iteration

Key Hyperparameters:

batch_size: 512
training_steps: 5000
generation_temperature: 1.0
+ 2 more
clip_epsilon: 1.0 (default)
prompt_weighting: Dependent on correctness rate r_q (e.g., sqrt((1-r)/r))

Compute: Single model copy in memory (memory efficient)

Comparison to Prior Work

vs. RFT: NFT utilizes negative samples (mistakes) which RFT discards
vs. GRPO: NFT is supervised (no explicit advantage estimation needed), but theoretically equivalent in on-policy settings; handles off-policy clipping differently (soft decay vs hard clip)
vs. PPO [not cited in paper]: NFT avoids training a separate value network/critic, unlike standard PPO

Limitations

Relies on the availability of a binary verifier or ground truth answers
Performance gain over RFT is smaller in smaller models (7B) compared to larger models (32B)
Strict equivalence to GRPO only holds in on-policy settings; off-policy behavior diverges

Reproducibility

Code availability is not explicitly provided in the text. Training uses the public DAPO-Math-17k dataset. The method relies on online generation and verification, standard in recent reasoning logic.

📊 Experiments & Results

Evaluation Setup

Online fine-tuning on math problems with no external teacher (self-generated data)

Benchmarks:

AIME 2024 (Math Competition)
AIME 2025 (Math Competition)
AMC 2023 (Math Competition)
MATH500 (Math Problem Solving)
OlympiadBench (Olympiad Math)
Minerva Math (Math Reasoning)

Metrics:

Average Accuracy across benchmarks
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average (32B Model Settings)	Percentage of total gain	100	20	-80

Experiment Figures

Entropy analysis of generated distributions over training steps for RFT vs NFT.

Main Takeaways

NFT consistently outperforms RFT, demonstrating that negative feedback (mistakes) contains valuable signal ignored by standard supervised learning.
The performance gap between RFT and RL is largely due to SL's neglect of negative data; bridging this gap allows SL to match RL methods like GRPO.
Negative feedback becomes increasingly important for larger models (32B vs 7B), suggesting that as models improve at memorization, reflection on errors becomes the new bottleneck.
RFT remains a very strong baseline, accounting for ~80% of the possible gains in the tested configurations.
Prioritizing harder questions (lower correctness rate) via weighting enhances performance.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)
Reinforcement Learning basics (Policy Gradient, PPO)
Concept of on-policy vs off-policy learning

Key Terms

RFT: Rejection Fine-Tuning—an SL method that generates multiple answers, filters for correctness, and fine-tunes the model only on the correct ones

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs for the same input

NFT: Negative-aware Fine-Tuning—the proposed supervised method that learns from both positive and negative self-generated data

implicit negative policy: A theoretical construct representing the distribution of incorrect answers, parameterized using the positive target model and the base model

on-policy: Training setting where the data is collected using the current version of the model being optimized

importance sampling: A technique to estimate properties of a target distribution using samples from a different (old) distribution

DAPO: Direct Advantage Preference Optimization—a recent RL alignment method for reasoning