DPO: Direct Preference Optimization—a method to align language models by optimizing the policy to prefer winning completions over losing ones without a separate reward model
steering vector: A vector in the model's activation space that, when added to hidden states, biases the output toward specific behaviors (e.g., safety)
logit: The raw, unnormalized scores output by the last layer of a neural network before the softmax function converts them into probabilities
activation space: The high-dimensional vector space where a model's intermediate representations (hidden states) reside
low-rank: A property of a matrix or update where the data lies in a subspace of much lower dimension than the full space; here, implies alignment affects only a few directions
G-Eval: An evaluation framework that uses strong LLMs (like GPT-4) to grade the quality of text generated by other models
BLEU: Bilingual Evaluation Understudy—a metric for evaluating text quality by counting matching n-grams between a candidate and reference text
ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation—a metric measuring the longest common subsequence between candidate and reference text
spectral collapse: A phenomenon where the singular values of a matrix drop off sharply, indicating the data has lower effective dimensionality (rank)