Reasoning Trace: The intermediate natural language steps ('thoughts') an LLM generates to solve a problem before stating the final answer
Process Reward Model (PRM): An evaluator trained to assign scores to individual reasoning steps rather than just the final outcome
Outcome Reward Model (ORM): An evaluator that scores the entire reasoning trace based primarily on the correctness of the final answer
Meta-evaluation: The process of evaluating the evaluators themselves, often using datasets with human-annotated step-quality labels
Factuality: Whether a step is grounded in the query or reliable external knowledge
Validity: Whether a step follows logically from previous steps without errors (e.g., correct arithmetic or entailment)
Coherence: Whether a step's preconditions are satisfied by previous steps (e.g., not using unexplained numbers)
Utility: Whether a step actually contributes progress toward the correct final solution
V-information: An information-theoretic metric measuring how much a specific input (like a reasoning trace) aids a model in predicting a target (like the final answer)
MCTS: Monte Carlo Tree Search—a search algorithm used to estimate the value (utility) of current states by simulating future paths
DPO: Direct Preference Optimization—a method for aligning models to preferences without explicit reward modeling
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-relative rewards
Self-consistency: A decoding strategy where the model generates multiple reasoning paths and selects the most frequent final answer