Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

📝 Paper Summary

Reinforcement Learning for Reasoning Reward Modeling

S-GRPO stabilizes reasoning model training by reweighting advantage signals based on group balance to mitigate the impact of false positive rewards where correct answers stem from flawed reasoning.

Core Problem

Standard GRPO assumes correct answers imply correct reasoning, but 'Think-Answer Mismatch' errors (correct answer, wrong logic) introduce reward noise that disproportionately corrupts gradients in unbalanced groups.

Why it matters:

In highly unbalanced groups (e.g., 1 correct out of 8), a single false positive mismatch can inflate the advantage signal by up to 60%, severely distorting the learning process.
Standard methods like GRPO collapse entirely under moderate noise levels (e.g., 20% mismatch rate), preventing models from learning robust reasoning patterns.
Existing heuristics like Dr. GRPO address variance but lack explicit noise modeling, failing to filter out high-risk signals from rare stochastic successes.

Concrete Example: When solving a math problem, a model might hallucinate gibberish like 'p(n) = n/(n^2-1)' that accidentally evaluates to the correct answer '8/63'. In a group where this is the only 'correct' response, standard GRPO assigns it a massive positive advantage, reinforcing the hallucination. S-GRPO identifies the group as unreliable and gates the update.

Key Novelty

Stable Group-Relative Policy Optimization (S-GRPO)

Models the Think-Answer Mismatch as symmetric label noise and derives a closed-form optimal weight that minimizes the error between observed and true advantages.
Introduces a 'Noise-Gating Mechanism' that automatically down-weights or zeros out signals from highly unbalanced groups where the likelihood of noise corruption is highest.

Architecture

The shape of the optimal advantage weight w* as a function of the number of correct responses k in a group of size N=16, for different noise levels p.

Evaluation Highlights

+2.5% average accuracy gain on Qwen-Math-7B-Base across four math benchmarks (AMC, MATH500, Minerva, OlympiadBench) compared to the strong baseline Dr. GRPO.
Maintains stable learning progress under 20% synthetic reward noise, whereas standard GRPO suffers complete performance collapse (zero improvement).
+2.4% average accuracy gain on Qwen-Math-1.5B-Instruct compared to Dr. GRPO, demonstrating effectiveness across different model scales.

Breakthrough Assessment

8/10

Addresses a fundamental flaw in outcome-based reasoning rewards (Think-Answer Mismatch) with a theoretically principled and empirically effective reweighting scheme. Robustness results are particularly strong.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning for Mathematical Reasoning using Group-Relative Policy Optimization

Inputs: Natural language query q (math problem)

Outputs: Reasoning chain and final answer o

Pipeline Flow

Response Generation (Actor generates N samples)
Reward Evaluation (Binary correctness check)
S-GRPO Advantage Calculation (Reweighting based on noise model)
Policy Update (PPO)

System Modules

Actor LLM

Generate reasoning traces and answers for queries

Model or implementation: Qwen2.5-Math-7B-Base / Qwen2.5-Math-1.5B-Instruct / Llama-3.2-3B-Base

Reward Oracle

Check if final answers match ground truth

Model or implementation: Rule-based checker

Modeling

Base Model: Qwen2.5-Math-7B-Base

Training Method: Stable Group-Relative Policy Optimization (S-GRPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to the old policy, using noise-corrected advantages.

Formally: L(θ) = E[min(r_t(θ) A_i, clip(r_t(θ), 1-ε, 1+ε) A_i)] where A_i = w* * standard_GRPO_advantage.
Purpose: Minimize error between observed and true advantage.

Formally: w* = (1-2p) * (sigma_t / sigma_r), derived to minimize E[(w* A_obs - A_true)^2].

Training Data:

8,500 problems sampled from MATH dataset (difficulty levels 3-5)

Key Hyperparameters:

group_size_N: 8
noise_probability_p: 0.10 or 0.15 (depending on model)
training_steps: Up to 500
+ 1 more
evaluation_interval: 16 steps

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: S-GRPO adds a scalar weight w* to the advantage term to account for label noise, whereas GRPO treats all rewards as ground truth.
vs. Dr. GRPO: Dr. GRPO upweights balanced groups via heuristic (removing denominator); S-GRPO derives an optimal weight from a noise model that includes explicit gating for low-confidence groups.

Limitations

Requires estimating the noise parameter p, which is a hyperparameter rather than learned.
Optimal p value is model-dependent (lower for stronger models, higher for weaker ones).
Experiments limited to math reasoning tasks; generalizability to other domains (e.g., code) not tested.
Training limited to 300-500 steps due to computational constraints.

Reproducibility

Code: https://github.com/shenpeijun0212/S-GRPO

Code and data publicly available at GitHub. Hyperparameters (p values, group size) are specified. Base models are open weights.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on standard benchmarks using Pass@1 accuracy (greedy sampling)

Benchmarks:

AMC (Math Competition Problems)
MATH500 (Challenging Math Problems)
Minerva (Math Reasoning)
OlympiadBench (Olympiad-level Math)

Metrics:

Pass@1 Accuracy
Statistical methodology: Average of top-3 checkpoint performances evaluated every 16 steps

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against Dr. GRPO baseline across different base models, showing consistent improvements.
Average (AMC, MATH500, Minerva, OlympiadBench)	Pass@1 Accuracy	53.5	56.0	+2.5
Average (AMC, MATH500, Minerva, OlympiadBench)	Pass@1 Accuracy	47.3	49.7	+2.4

Experiment Figures

Training curves (Accuracy vs Steps) under 10% and 20% synthetic reward noise.

Training dynamics (Accuracy vs Steps) for different noise prior values (p=0, 0.10, 0.15).

Main Takeaways

S-GRPO consistently outperforms GRPO and Dr. GRPO across multiple model scales (1.5B to 7B) and types (Base vs Instruct).
The method exhibits high robustness to reward noise; at 20% synthetic noise, standard GRPO collapses while S-GRPO maintains learning.
There is a trade-off between learning speed and stability controlled by the noise parameter p: higher p leads to slower initial starts (due to gating) but more stable, monotonic convergence.
S-GRPO promotes longer chain-of-thought responses (13-30% increase in length) and higher rates of self-reflection compared to baselines.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Group-Relative Policy Optimization (GRPO)
Symmetric Label Noise

Key Terms

GRPO: Group-Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a response's reward to the group average, eliminating the need for a critic model.

Think-Answer Mismatch: A phenomenon where a model generates a correct final answer despite using flawed or nonsensical reasoning steps.

S-GRPO: Stable Group-Relative Policy Optimization—the proposed method that reweights GRPO advantages to account for potential reward noise.

Symmetric Label Noise: A noise model where the observed label (reward) flips from the true label with a fixed probability p.

Dr. GRPO: Distribution-Robust GRPO—a baseline heuristic that modifies GRPO by removing standard deviation normalization.

PPO: Proximal Policy Optimization—a standard RL algorithm used here to update the policy using the computed advantages.

Advantage: A value measuring how much better a specific action is compared to the average action in that state.