TTS: Test-Time Scaling—methods that allocate additional computation during inference (test time) to improve model performance, often akin to 'thinking longer'
System 2: A cognitive science term describing slow, deliberate, and analytical thinking, which TTS aims to emulate in LLMs (contrast with System 1 rapid response)
SFT: Supervised Fine-Tuning—training a model on labeled examples (e.g., long reasoning chains) to teach it structured reasoning patterns
RL: Reinforcement Learning—a training method where agents learn to make decisions by receiving rewards or penalties, used here to optimize reasoning policies
PRM: Process Reward Model—a verifier trained to evaluate the correctness of intermediate reasoning steps rather than just the final outcome
ORM: Outcome Reward Model—a verifier that scores the final answer of a reasoning chain
MCTS: Monte Carlo Tree Search—a search algorithm that balances exploration and exploitation to find optimal paths in decision trees, used to guide LLM reasoning steps
Internal Scaling: Eliciting a model to autonomously determine reasoning length within its internal parameters (e.g., o1), rather than relying on external scaffolding
Parallel Scaling: Generating multiple candidate solutions simultaneously (e.g., voting) to increase the probability of finding a correct answer
Sequential Scaling: Iteratively refining or extending a single reasoning chain, where each step depends on previous steps