DPO: Direct Preference Optimization—a method to align language models with preferences by optimizing a classification loss on preference pairs rather than training a reward model.
SFT: Supervised Fine-Tuning—training a model on a labeled dataset of high-quality instruction-response pairs.
HIPO: Hard Sample-aware Iterative Direct Preference Optimization—the authors' proposed method that iteratively selects difficult negative samples for DPO training.
NHSR: Non-Hallucinated Statute Rate—a metric measuring the proportion of cited statutes that are entirely accurate in name, number, and content.
BERTScore: A metric for evaluating text generation by computing token similarity using contextual embeddings.
Behavior Cloning: In this context, refers to the initial Supervised Fine-Tuning (SFT) stage where the model learns to mimic the provided high-quality legal answers.
NLL loss: Negative Log-Likelihood loss—the standard loss function used to train language models to predict the next token.