PRM: Process Reward Model—a model that assigns a score to each intermediate step of a reasoning chain, rather than just the final answer
Reward Hacking: When an AI agent exploits flaws in the reward function to get a high score without actually achieving the intended goal (e.g., writing gibberish that looks like math)
Goodhart's Law: The economic principle that 'when a measure becomes a target, it ceases to be a good measure'—here, optimizing for PRM score degrades actual accuracy
Adversarial Tokens: Specific sequences of text (tokens) found via optimization that trick a model into outputting a high score or specific behavior
AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for reasoning capabilities
Fluency-Logic Dissociation: The phenomenon where a model can distinguish good writing style (fluency) but fails to distinguish correct from incorrect logic
Entropy Regularization: A technique during optimization that forces the model to choose distinct, discrete words (tokens) rather than vague mixtures of meanings