GRPO-$λ$: Credit Assignment improves LLM Reasoning

📝 Paper Summary

LLM Reasoning Reinforcement Learning for LLMs Post-training optimization

GRPO-λ improves LLM reasoning by incorporating eligibility traces into the critic-free GRPO algorithm, enabling better credit assignment to earlier tokens in generated sequences without a value network.

Core Problem

The state-of-the-art GRPO algorithm lacks a critic model, which prevents fine-grained credit assignment across tokens; it relies on group averages that become increasingly biased for later tokens in a sequence.

Why it matters:

Effective reasoning requires identifying exactly which steps in a long chain of thought led to the correct solution
Standard GRPO assigns the same sparse reward to all tokens, failing to distinguish crucial reasoning steps from irrelevant ones
Training a separate critic model (like in PPO) is memory-intensive and difficult due to the disparity between the pre-trained policy and the initialized critic

Concrete Example: In a multi-step math problem, if an LLM generates a correct answer, GRPO rewards every token equally. However, if the model made a lucky guess after a flawed intermediate step, GRPO reinforces the flaw. GRPO-λ uses traces to propagate the final reward back to earlier, critical decision points more effectively.

Key Novelty

Critic-free Eligibility Traces for LLMs (GRPO-λ)

Reformulates Generalized Advantage Estimation (GAE) to work without a critic model by using token-level log-probabilities and group-relative rewards
Introduces a 'both' weighting strategy that balances credit assignment between early tokens (which have lower value estimation error) and late tokens (which are closer to the final reward)
Proves a bound on the error of using start-state value estimates for later tokens, justifying the need for decaying weights on intermediate steps

Architecture

Pseudocode for GRPO-λ showing how eligibility traces are integrated into the GRPO update loop.

Evaluation Highlights

+33 points average improvement over GRPO across 5 benchmarks (AIME24, Math500, OlympiadMath, MinervaMath, AMC)
30-40% improved performance during RL training on LLaMA-3.1 and Qwen-2.5 architectures compared to standard GRPO
+4.5 points improvement on the Deepseek-R1-Distill-Qwen-7B model compared to GRPO baseline

Breakthrough Assessment

8/10

Offers a mathematically grounded, memory-efficient improvement to the current SFT/RL pipeline for reasoning. Significant empirical gains (+33 points) without the overhead of a critic model make it highly practical.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) where the LLM is the policy π, tokens are actions, and the context/history is the state.

Inputs: A mathematical question/prompt tokenized to m tokens s_0 = (a^0, ..., a^{m-1})

Outputs: A sequence of generated tokens leading to a final answer

Pipeline Flow

Policy Rollout: Generate group of G outputs from prompt s_0
Reward Calculation: Evaluate correctness of final answers (1 for correct, 0 for incorrect)
Advantage Estimation: Compute advantages using GRPO-λ reformulation
Policy Update: Optimize policy using PPO-style clipped objective with new advantages

System Modules

Generator (Policy)

Generates reasoning steps and final answers; updated via RL

Model or implementation: LLaMA-3.1-8B-Instruct or Qwen-2.5-Math-1.5B/7B or Deepseek-R1-Distill-Qwen-1.5B/7B

Reward Oracle

Verifies the correctness of the generated answer against ground truth

Model or implementation: Deterministic verifier

Novel Architectural Elements

Critic-free implementation of eligibility traces applied directly to the policy loss weighting
Token-specific weighting mechanism ('both' trace) that assigns high importance to both early tokens (low value error) and late tokens (proximity to reward)

Modeling

Base Model: LLaMA-3.1-8B-Instruct, Qwen-2.5-Math-1.5B/7B, Deepseek-R1-Distill-Qwen-1.5B/7B

Training Method: GRPO-λ (Reinforcement Learning)

Objective Functions:

Purpose: Optimize policy to maximize expected return using reparameterized GAE.

Formally: ℓ_π = min(π_ratio_GAE(s_t) * δ_t, clip(...) * δ_t), where π_ratio_GAE incorporates λ-weighting.
Purpose: Prevent policy from deviating too far from pre-trained reference.

Formally: ℓ_KL = D_KL(π_θ || π_ref).

Adaptation: Full fine-tuning

Training Data:

Trained on 44 different math reasoning datasets (details not explicitly listed but mentions MathShepherd and GSM8K in context of evaluation)

Key Hyperparameters:

group_size: Not explicitly reported in the paper
lambda: Evaluated λ ∈ [0, 1] (ablation shows optimal around 0.95-0.98 depending on trace type)
gamma: Typically 1.0 for episodic math tasks
+ 1 more
clip_epsilon: Standard PPO clipping (usually 0.1 or 0.2, exact value not reported)

Compute: Negligible walltime difference between GRPO and GRPO-λ

Comparison to Prior Work

vs. GRPO: Adds eligibility traces (λ) to improve credit assignment without adding a critic
vs. PPO: Removes the memory-intensive critic network while retaining the benefits of GAE [not cited in paper as direct baseline, but conceptual comparison]
vs. VinePPO: Significantly more computationally efficient as it does not require branching rollouts at every token [not cited in paper as direct baseline]

Limitations

Computational overhead increases linearly with sequence length (though claimed to be negligible)
Depends on group size for variance reduction (inherited from GRPO)
Binary sparse rewards might still be challenging for very long reasoning chains without intermediate supervision

Reproducibility

Code availability is not provided. The paper includes mathematical proofs in Appendix A and pseudocode for the algorithms. Hyperparameters like group size and learning rates are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with verifiable ground truth answers

Benchmarks:

GSM8K (Grade school math word problems)
MathShepherd (Math reasoning)
AIME24 (Competition Math)
Math500 (Competition Math)
OlympiadMath (Competition Math)
MinervaMath (Math reasoning)
AMC (American Mathematics Competitions)

Metrics:

Accuracy (Pass@1)
Average Return
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average of 5 benchmarks (AIME24, Math500, OlympiadMath, MinervaMath, AMC)	Average Score	Not reported in the paper	Not reported in the paper	+33.00
Deepseek-R1-Distill-Qwen-7B Evaluation	Score improvement	Not reported in the paper	Not reported in the paper	+4.50
Math reasoning datasets (Training plots)	Performance Improvement during training	Not reported in the paper	Not reported in the paper	+30%

Experiment Figures

Bar chart comparing GRPO and GRPO-λ performance on Deepseek-R1-Distill-Qwen-7B across 5 benchmarks.

Comparison of different trace weighting schemes (recent, both) and λ values.

Main Takeaways

GRPO-λ consistently outperforms GRPO across multiple model sizes (1.5B to 7B) and architectures (LLaMA, Qwen).
The proposed 'both' trace weighting strategy, which emphasizes both early and late tokens, yields the best results compared to traditional decaying traces.
The method achieves faster convergence during training compared to standard GRPO.
Improvements are particularly strong on difficult competition math benchmarks (AIME, OlympiadMath).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (Policy Gradients, Value Functions)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

Key Terms

GRPO: Group Relative Policy Optimization—a critic-free RL algorithm that estimates baselines by averaging returns from a group of outputs generated from the same prompt

GAE: Generalized Advantage Estimation—a technique to reduce variance in policy gradient estimates by exponentially weighting multi-step returns

Eligibility Traces: A mechanism in RL that tracks which states/actions recently occurred to assign credit for rewards received later

PPO: Proximal Policy Optimization—a standard RL algorithm that uses a clipped objective to prevent training instabilities

Credit Assignment: The problem of determining which past actions are responsible for a delayed reward

Critic: A neural network that estimates the value (expected return) of a state, used to compute advantages in algorithms like PPO

TD error: Temporal-Difference error—the difference between the estimated value of the current state and the actual reward plus the estimated value of the next state