_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates low-rank matrices added to existing weights rather than all weights
SVD: Singular Value Decomposition—a mathematical method to factorize a matrix, used here to identify principal components (neurons) that contribute most to model behavior
ASR: Attack Success Rate—the percentage of harmful prompts for which the model generates a harmful response
MMLU: Massive Multitask Language Understanding—a benchmark evaluating models on a wide range of subjects to measure general capability
Safety-Critical Neurons: Specific neurons (parameters) within the model that are highly active or essential for processing safety-related concepts and refusals
Model Extrapolation: A technique to enhance model features (like safety) by linearly extending the weight difference between a weak and a strong model (or base and aligned model)
Frobenius Norm: A matrix norm defined as the square root of the sum of the absolute squares of its elements, used here to measure similarity between weight matrices
BeaverTails: A dataset containing safety-related prompts and responses (both safe and unsafe) used for training and evaluation
Alpaca: A dataset of instruction-following examples used for benign task fine-tuning