MRT: Meta Reinforcement Fine-Tuning—the proposed method that trains LLMs to minimize cumulative regret over reasoning episodes.
Cumulative Regret: The sum of differences between the optimal reward achievable by a budget-agnostic oracle and the actual reward obtained by the policy across episodes.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used as a baseline that optimizes policies based on group-relative rewards.
Meta-Prover Policy: A policy (denoted as μ) used to estimate the probability of success/reward at intermediate steps, effectively acting as a value estimator.
Episodes: Segments of the LLM's output stream (e.g., blocks between <think> tags or steps in a search tree) treated as individual attempts or reasoning steps.
Budget-Agnostic: A property of a policy where it performs optimally regardless of the specific test-time compute budget, scaling performance naturally as budget increases.
STaR: Self-Taught Reasoner—an iterative fine-tuning method where a model generates reasoning traces, and correct ones are used for fine-tuning.
Warm Start: Initial supervised fine-tuning on high-quality data to stabilize the model before beginning reinforcement learning.
Dense Reward: A reward signal provided at every step or episode (intermediate feedback) rather than just at the very end (sparse outcome reward).