LLM-as-a-Judge: Using a Language Model to evaluate the quality of text generated by other models, often replacing human annotation
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input, often used to reduce variance without a separate critic model
CoT: Chain-of-Thought—a prompting technique that encourages models to generate intermediate reasoning steps before the final answer
DPO: Direct Preference Optimization—an offline method for aligning language models to preferences without explicit reward modeling
Verifiable Rewards: Rewards based on objectively checkable outcomes (e.g., correct math answer, correct preference prediction) rather than learned approximations
Position Bias: The tendency of LLM judges to prefer the first (or second) option presented, regardless of actual quality
Pointwise Evaluation: Evaluating a single response in isolation to assign it a score
Pairwise Evaluation: Comparing two responses side-by-side to determine which is better
Synthetic Data: Training data generated by AI models rather than collected from humans
Distant Supervision: Training a model (here, pointwise judge) using labels from a related task (pairwise preference) rather than direct annotations