LLM-as-a-Judge: Using a large language model to evaluate the quality of text generated by other models, effectively replacing human annotators
SFT Warm-Up: Supervised Fine-Tuning phase where the model learns the format and reasoning style of a judge using high-quality demonstrations
DPO: Direct Preference Optimization—a stable method for aligning language models to preferences without training a separate reward model
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
NLL loss: Negative Log-Likelihood loss, added as a regularization term during DPO to maintain generation quality
RewardBench: A benchmark dataset designed to evaluate reward models and judge models on their ability to correctly identify preferred responses
Position Bias: The tendency of a judge model to prefer the first (or second) response regardless of content; mitigated here by swapping response orders
Length Bias: The tendency of a judge model to prefer longer responses regardless of quality