SLMs: Small Language Models—models with roughly 10 billion parameters or fewer
Reasoning Trace: The step-by-step text explanation generated by a model before its final answer
PRM: Process Reward Model—a model trained to score the correctness of individual reasoning steps rather than just the final outcome
CoT: Chain-of-Thought—a prompting strategy encouraging models to generate intermediate reasoning steps
Best-of-N: A decoding strategy where multiple candidate responses are generated, and a reward model selects the best one
LLM-as-a-judge: Using a strong language model (like GPT-4) to evaluate the quality or correctness of another model's output
Commonsense Reasoning: Tasks requiring general world knowledge and intuitive physics/social understanding, rather than specialized math or coding skills
Hallucination: Generated content that is factually incorrect, unverifiable, or not grounded in the provided context