TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Process Reward Models (PRM) LLM Mathematical Reasoning

TDRM improves large language model reasoning by training process reward models using temporal difference learning, creating smoother reward signals that enhance both online reinforcement learning and inference-time search.

Core Problem

Existing reward models lack temporal consistency, assigning disconnected scores to adjacent reasoning steps, which leads to unstable training signals and inefficient search during inference.

Why it matters:

Inconsistent rewards make it difficult for models to distinguish which specific step contributed to success or failure, especially in long chain-of-thought reasoning.
Standard outcome-based rewards provide sparse feedback (only at the end), while current process rewards often fail to update based on future context, leading to misleading guidance.
Ineffective reward modeling hampers the data efficiency of reinforcement learning, requiring massive datasets to achieve performance that could be reached with far fewer samples.

Concrete Example: In a long math proof, a standard reward model might assign a high score to a step that looks correct locally but actually leads to a dead end. TDRM, by propagating future value estimates backward, would lower the score of that earlier step once the dead end is realized, correcting the signal.

Key Novelty

Temporal Difference Reward Modeling (TDRM)

Applies n-step Temporal Difference (TD) learning to train process reward models online, where the reward for a current step is updated based on the estimated value of future steps.
Integrates a cosine-based reward shaping mechanism to stabilize training across varying chain-of-thought lengths, preventing rewards from collapsing for long reasoning traces.
Combines these smooth process rewards with rule-based verification (outcome rewards) in a linear combination to guide Group Relative Policy Optimization (GRPO) training.

Architecture

Overview of the TDRM framework, illustrating the interaction between the Policy Model, the TDRM (Process Reward Model), and the RL update loop.

Evaluation Highlights

Achieves comparable RL performance to baselines using only 2.5k data samples, whereas baselines require 50.1k data samples (approx. 20x data efficiency gain).
Improves inference-time tree search accuracy by up to +23.7% compared to standard process reward models.
Boosts Best-of-NN verification performance by up to +6.6% across various model sizes and families (e.g., Qwen2.5, GLM-4).

Breakthrough Assessment

8/10

Significant gains in data efficiency (20x) and consistent improvements across diverse models and tasks suggest a robust methodology that addresses a fundamental weakness in current reward modeling.

⚙️ Technical Details

Problem Definition

Setting: LLM reasoning modeled as a Markov Decision Process (MDP) where states are token sequences and actions are newly generated sentences.

Inputs: Input prompt q (e.g., math problem)

Outputs: Reasoning chain o = (o_1, ..., o_T) and final answer

Pipeline Flow

Prompt Input
Policy Model Generation (produces reasoning steps)
TDRM Evaluation (assigns process rewards to steps)
Outcome Verification (assigns final reward)
Reward Aggregation (combines process + outcome)

System Modules

Policy Model

Generates step-by-step reasoning trajectories given a prompt.

Model or implementation: Various (e.g., Qwen2.5-Math-7B, DeepSeek-R1-Distill-Qwen-7B)

TDRM (PRM) (Evaluation)

Estimates the value (process reward) of each intermediate reasoning step using a TD-trained value function.

Model or implementation: Same architecture as Policy Model (initialized from it), with a scalar head.

Verifier (Evaluation)

Checks if the final answer matches the ground truth and is properly formatted.

Model or implementation: Rule-based function (is_equivalent, has_boxed)

Novel Architectural Elements

Integration of online TD-estimated value targets directly into the reward modeling loop for LLM reasoning.
Hybrid reward aggregation combining rule-based outcome verification with TD-derived process rewards for GRPO.

Modeling

Base Model: Qwen2.5-(0.5B, 1.5B, 3B, 7B, 14B, 32B), Qwen2.5-Math-(1.5B, 7B), DeepSeek-R1-Distill-Qwen-(1.5B, 7B), GLM4-9B-0414, GLM-Z1-9B-0414

Training Method: Group Relative Policy Optimization (GRPO) guided by TDRM

Objective Functions:

Purpose: Train the Process Reward Model (PRM) to predict future values consistently.

Formally: Cross-entropy loss between model output probability p_t and clamped TD target v_tilde_t.
Purpose: Optimize the Policy Model to maximize expected rewards.

Formally: GRPO objective maximizing advantage A_i,j estimated from group-normalized combined rewards (linear combination of PRM and verifiable rewards) minus KL divergence penalty.

Adaptation: Full fine-tuning (implied, no LoRA mentioned for main results)

Trainable Parameters: All parameters of the Policy Model and the Reward Model

Training Data:

MATH (12k problems) for training.
GSM8K and OOD benchmarks (OlympiadBench, GAIDC) for evaluation.

Key Hyperparameters:

n_step_td: 3 (optimal based on ablation)
td_lambda: Not specified for main results, compared in ablation
reward_combination_alpha: Hyperparameter balancing process and outcome rewards (specific value not explicitly listed in text, likely tuned)
+ 1 more
kl_coefficient_beta: Standard GRPO setting (exact value not in text)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ScalarPRM: TDRM uses bootstrapping (TD) for smoother, temporally consistent updates rather than assigning the final outcome to every step.
vs. DeepSeek-R1-Zero (GRPO only): TDRM adds dense process signals to the sparse outcome rewards, improving data efficiency.
vs. Math-Shepherd: TDRM learns a value function via TD rather than relying purely on verifying generated steps against outcomes.
+ 1 more
vs. Standard PPO [not cited in paper]: TDRM uses GRPO which avoids a separate value network for the policy, though it trains a separate PRM for rewards.

Limitations

Computational cost of training a separate reward model alongside the policy.
Dependence on verifiable domains (math) where ground truth is available for the outcome reward component.
Sensitivity to the balance parameter alpha between process and outcome rewards.
Complexity of tuning TD hyperparameters (n-step, lambda) for stability.

Reproducibility

Code: https://github.com/THUDM/TDRM

Code is publicly available at https://github.com/THUDM/TDRM. The paper uses standard datasets (MATH, GSM8K). Specific hyperparameters like learning rates or alpha values for reward combination are not exhaustively listed in the main text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using Chain-of-Thought generation.

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Challenging competition math problems)
OlympiadBench (Olympiad-level math and physics problems (OOD))
GAIDC (Math problems (OOD))

Metrics:

Accuracy (Pass@1)
Best-of-N Accuracy
Tree Search Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reinforcement Learning (RL) performance improvements showing TDRM's effectiveness compared to baselines using different policy models.
MATH	Accuracy	56.4	66.5	+10.1
MATH	Accuracy	63.9	68.9	+5.0
MATH	Accuracy	13.5	20.4	+6.9
Inference-time verification results comparing TDRM against ScalarPRM in Best-of-N settings.
MATH	Best-of-64 Accuracy	73.5	77.5	+4.0
GSM8K	Best-of-64 Accuracy	93.4	95.5	+2.1
Ablation study on TD steps (n) showing the impact of lookahead length on performance.
MATH	Best-of-64 Accuracy	74.2	77.5	+3.3

Experiment Figures

Comparison of reward smoothness between ScalarPRM and TDRM.

Tree Search performance (Success Rate vs. Number of Rollouts) on MATH and GSM8K.

Main Takeaways

TDRM significantly improves data efficiency, matching baseline performance with ~20x less data (2.5k vs 50.1k samples).
Smoother reward landscapes (lower Lipschitz constant) generated by TD learning lead to better tree-search and Best-of-N performance compared to standard Monte Carlo PRMs.
The method generalizes across multiple model families (Qwen, GLM, DeepSeek) and sizes (0.5B to 7B), consistently outperforming base models and standard RL baselines.
Reward shaping using cosine schedules combined with TD updates helps stabilize training for long Chain-of-Thought sequences.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (MDPs, value functions)
Temporal Difference (TD) Learning
Large Language Models (LLMs) and Chain-of-Thought (CoT)
Process Reward Models (PRM) vs. Outcome Reward Models (ORM)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

TDRM: Temporal Difference Reward Modeling—the proposed method using TD learning to train smoother process reward models.

PRM: Process Reward Model—a model that assigns scores to intermediate steps of reasoning, not just the final answer.

ORM: Outcome Reward Model—a model that assigns scores based solely on the correctness of the final result.

TD learning: Temporal Difference learning—an RL method where value estimates are updated based on other value estimates (bootstrapping) rather than waiting for the final outcome.

n-step TD: A variant of TD learning that looks n steps into the future to update the current value estimate.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs for the same prompt, removing the need for a separate value network.

RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness checks (e.g., math answers) as reward signals.

CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer.

Best-of-NN: An inference strategy where N solutions are generated, and the one with the highest reward model score is selected.

Tree Search: An inference strategy (like beam search or lookahead search) that explores multiple reasoning paths and uses a reward model to prune or prioritize them.

Lipschitz constant: A measure of smoothness; a smaller constant implies the function (reward model) changes less abruptly between inputs.

Cosine Reward: A reward shaping function used in this paper that adjusts rewards based on correctness and step length, following a cosine curve.

OOD: Out-of-Distribution—data that differs significantly from the training data.

DeepSeek-R1: A specific family of reasoning-focused large language models.

Qwen2.5: A family of large language models developed by Alibaba Cloud.

GLM-4: A family of large language models developed by Tsinghua University / Zhipu AI.

TD-lambda: An algorithm generalizing n-step TD that uses an eligibility trace to update past states based on current rewards, allowing faster propagation of credit.

Double newline delimiter: The separator used in this paper to define a single 'step' in the reasoning chain.

Cross-Entropy Loss: A loss function used here to train the PRM by treating the clamped TD target as a soft label.