_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
TDRM: Temporal Difference Reward Modeling—the proposed method using TD learning to train smoother process reward models.
PRM: Process Reward Model—a model that assigns scores to intermediate steps of reasoning, not just the final answer.
ORM: Outcome Reward Model—a model that assigns scores based solely on the correctness of the final result.
TD learning: Temporal Difference learning—an RL method where value estimates are updated based on other value estimates (bootstrapping) rather than waiting for the final outcome.
n-step TD: A variant of TD learning that looks n steps into the future to update the current value estimate.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs for the same prompt, removing the need for a separate value network.
RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness checks (e.g., math answers) as reward signals.
CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer.
Best-of-NN: An inference strategy where N solutions are generated, and the one with the highest reward model score is selected.
Tree Search: An inference strategy (like beam search or lookahead search) that explores multiple reasoning paths and uses a reward model to prune or prioritize them.
Lipschitz constant: A measure of smoothness; a smaller constant implies the function (reward model) changes less abruptly between inputs.
Cosine Reward: A reward shaping function used in this paper that adjusts rewards based on correctness and step length, following a cosine curve.
OOD: Out-of-Distribution—data that differs significantly from the training data.
DeepSeek-R1: A specific family of reasoning-focused large language models.
Qwen2.5: A family of large language models developed by Alibaba Cloud.
GLM-4: A family of large language models developed by Tsinghua University / Zhipu AI.
TD-lambda: An algorithm generalizing n-step TD that uses an eligibility trace to update past states based on current rewards, allowing faster propagation of credit.
Double newline delimiter: The separator used in this paper to define a single 'step' in the reasoning chain.
Cross-Entropy Loss: A loss function used here to train the PRM by treating the clamped TD target as a soft label.