CoT: Chain-of-Thought—a prompting technique encouraging the model to generate intermediate reasoning steps
ToT: Tree-of-Thought—generalizes CoT by exploring multiple reasoning paths in a tree structure
MCTS: Monte Carlo Tree Search—a heuristic search algorithm balancing exploration and exploitation to find optimal decision paths
RAG: Retrieval-Augmented Generation—enhancing model generation by retrieving relevant external documents
FLOPs: Floating Point Operations—a measure of computational cost
Self-Consistency: Generating multiple reasoning paths and selecting the answer via majority vote
ReAct: Reason+Act—interleaving reasoning traces with actions (like tool calls) and observations
PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to tune models based on rewards
Beam Search: A search algorithm that keeps track of the top-k most probable sequences at each decoding step
GRPO: Group Relative Policy Optimization—a reinforcement learning method that optimizes policy based on group relative rewards
RLVR: Reinforcement Learning with Verifiable Rewards—training models to produce coherent reasoning chains using verifiable outcomes
Process Reward Model (PRM): A reward model that evaluates intermediate steps of reasoning rather than just the final outcome