_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
DPO: Direct Preference Optimization—an alignment algorithm that fine-tunes models on preference pairs without an explicit reward model
RLHF: Reinforcement Learning from Human Feedback—a method to align models using human preferences
MLP: Multilayer Perceptron—the feed-forward networks within Transformer layers where much knowledge is stored
SVD: Singular Value Decomposition—a linear algebra technique to factorize a matrix into singular vectors and values, used here to isolate dimensions of toxicity
residual stream: The primary vector pathway in a Transformer where information is added by attention and MLP layers
key-value vectors: Components of the MLP layer; key vectors detect patterns in the input, and value vectors add information to the residual stream
PPO: Proximal Policy Optimization—a standard RLHF algorithm that uses a reward model (contrasted with DPO here)
Jigsaw: A dataset of toxic and non-toxic comments used for training toxicity classifiers
RealToxicityPrompts: A dataset of prompts designed to elicit toxic generations from language models
PPLM: Plug and Play Language Models—a method used here to generate synthetic toxic/non-toxic pairs for DPO training
linear probe: A simple linear classifier trained on internal representations to detect specific properties (like toxicity)