GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning Reward Engineering

GTPO and GRPO-S improve LLM reasoning by using policy entropy to dynamically redistribute rewards, assigning higher credit to uncertain steps in correct solutions and penalizing confident errors.

Core Problem

Mainstream RL algorithms like GRPO use coarse-grained credit assignment, giving identical rewards to all tokens in a sequence based solely on the final outcome.

Why it matters:

Long reasoning chains with a single final error receive zero reward, penalizing the many correct intermediate logical steps.
Conversely, sequences reaching the correct answer through flawed or guessed steps receive full reward, reinforcing bad reasoning.
Existing methods treat entropy only as a regularizer or filter, failing to actively reshape the reward signal for better supervision.

Concrete Example: A math reasoning sequence with dozens of correct logical steps might end with a calculation error, resulting in a binary reward of 0. GRPO treats this entire sequence as equally 'bad' as a completely nonsensical answer, wasting valuable training signal from the correct intermediate steps.

Key Novelty

Dynamic Entropy Weighting

Repurposes policy entropy as a proxy for 'cognitive effort' or 'pivotal decision points' rather than just noise.
In correct solutions, high entropy signals valuable exploration (difficult steps navigated correctly) and receives a reward bonus.
In incorrect solutions, low entropy signals 'confident errors' and receives a heavier penalty to discourage stubborn incorrect reasoning.

Architecture

Illustration of the Dynamic Entropy Weighting mechanism. It contrasts 'Coarse-grained Reward' (GRPO) with the proposed method.

Evaluation Highlights

GTPO achieves +6.8% accuracy improvement on MATH 500 compared to GRPO using Qwen2.5-Math-7B.
GRPO-S outperforms GRPO by +3.5% on the AIME 2024 benchmark using Llama-3.1-8B-Instruct.
GTPO outperforms the strong baseline DAPO by +2.2% on MATH 500 using Qwen2.5-Math-7B.

Breakthrough Assessment

8/10

Offers a theoretically grounded and empirically effective solution to the long-standing credit assignment problem in RLHF without requiring an external value model (critic).

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from outcome-based rewards (e.g., correct/incorrect answer) without an explicit value function.

Inputs: A prompt q (e.g., a math problem).

Outputs: A generated reasoning sequence o ending in an answer.

Pipeline Flow

Prompt Sampling
Group Generation
Reward & Entropy Calculation
Advantage Estimation
Policy Update

System Modules

Policy Model

Generates a group of G responses for a given prompt q.

Model or implementation: Llama-3.1-8B-Instruct or Qwen2.5-Math-7B

Entropy Calculator (Evaluation)

Calculates per-token entropy for GTPO or average sequence entropy for GRPO-S.

Model or implementation: Mathematical formula (Eq 2 or 6)

Reward Shaper (Evaluation)

Combines binary outcome rewards with entropy values to create shaped rewards.

Model or implementation: Weighting formula (Eq 3/4 for GTPO, Eq 7/8 for GRPO-S)

Novel Architectural Elements

Dynamic Entropy Weighting mechanism inserted into the reward calculation pipeline, altering the advantage function based on token-level or sequence-level uncertainty.

Modeling

Base Model: Llama-3.1-8B-Instruct and Qwen2.5-Math-7B

Training Method: Group Token Policy Optimization (GTPO) and Sequence-Level GRPO (GRPO-S)

Objective Functions:

Purpose: Optimize policy using entropy-shaped token-level rewards.

Formally: J_GTPO(θ) combines PPO-style clipped loss with advantages derived from entropy-weighted rewards, summing over successful (bonus) and unsuccessful (penalty) sets.
Purpose: Optimize policy using entropy-shaped sequence-level rewards.

Formally: J_GRPO-S(θ) uses sequence-average entropy to modulate the global reward before computing group relative advantages.

Adaptation: Full fine-tuning (implied, as RL typically updates full weights or large subsets)

Trainable Parameters: Full model parameters

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
alpha_1: Hyperparameter balancing binary reward and entropy bonus (GTPO)
+ 4 more
alpha_2: Hyperparameter balancing binary reward and entropy bonus (GTPO)
beta_1: Hyperparameter for sequence-level entropy bonus (GRPO-S)
beta_2: Hyperparameter for sequence-level entropy penalty (GRPO-S)
clip_epsilon: Standard PPO clipping parameter (implied in Eq 5/9)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: GTPO/GRPO-S introduce entropy-based reward modulation; GRPO uses uniform rewards.
vs. DAPO: GTPO/GRPO-S use entropy to differentiate credit within correct/incorrect groups; DAPO treats all correct/incorrect outcomes uniformly.
vs. PPO: GTPO/GRPO-S are critic-free (value-function-free), relying on group baselines.

Limitations

Computational overhead for calculating per-token entropy in GTPO compared to sequence-level methods.
Complexity in tuning weighting hyperparameters (alpha/beta) for the entropy components.
Relies on the assumption that high entropy in correct answers always correlates with 'valuable exploration', which might not hold for lucky guesses.
No explicit statistical significance tests provided for the main results.

Reproducibility

No code URL provided. Hyperparameters like learning rate and batch size are missing from the text. Mathematical derivations are provided in appendices.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using Chain-of-Thought prompting.

Benchmarks:

GSM8K (Grade school math word problems)
MATH 500 (Challenging competition mathematics problems)
AIME 2024 (High-difficulty math competition problems)

Metrics:

Accuracy (Pass@1)
Pass@K (Pass@2, Pass@8, Pass@32)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH 500	Accuracy	59.0	65.8	+6.8
MATH 500	Accuracy	63.6	65.8	+2.2
AIME 2024	Accuracy	13.3	16.8	+3.5
GSM8K	Accuracy	79.5	83.1	+3.6
MATH 500	Pass@32	78.4	82.6	+4.2

Experiment Figures

A conceptual comparison of Credit Assignment granularity between GRPO (Sequence-level, coarse), GRPO-S (Sequence-level with entropy), and GTPO (Token-level with entropy).

Main Takeaways

Both GTPO and GRPO-S consistently outperform GRPO and DAPO across multiple math benchmarks.
GTPO (token-level) tends to offer higher precision, while GRPO-S (sequence-level) offers superior stability in tasks requiring long Chain-of-Thought reasoning.
The entropy-based reward shaping successfully differentiates between valuable exploration (reinforced) and confident errors (penalized).
Improvements are visible in both Pass@1 accuracy and Pass@K metrics, indicating better exploration and robustness.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)
Policy Entropy

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sequence's reward to the average reward of a group of samples for the same prompt, avoiding a critic model.

GTPO: Group Token Policy Optimization—the proposed token-level algorithm that assigns unique, entropy-weighted rewards to every token.

GRPO-S: Sequence-Level Group Relative Policy Optimization—the proposed sequence-level variant that modulates the global reward for a sequence based on its average entropy.

DAPO: Direct Advantage Policy Optimization—a baseline RL method similar to GRPO but often using different reward normalization or loss formulations.

CoT: Chain-of-Thought—a prompting strategy where models generate intermediate reasoning steps before the final answer.

Policy Entropy: A measure of the randomness or uncertainty in the model's next-token prediction distribution.

Importance Sampling: A technique used in RL (specifically PPO/GRPO) to estimate properties of a target distribution while sampling from a different (older) distribution, using a ratio of probabilities.