_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
SAE: Sparse Autoencoder—a neural network trained to decompose dense model activations into sparse, interpretable features
IFT: Instruction Fine-Tuning—the process of training pre-trained Large Language Models (LLMs) on dataset of instructions and responses to follow user commands
Residual Stream: The primary vector pathway in a Transformer model where information is added by attention and feed-forward layers
IFEval: Instruction Following Evaluation—a benchmark that measures a model's ability to follow verifiable constraints in instructions (e.g., 'no capitalization')
AlpacaEval 2.0: A benchmark using an LLM-based judge to compare model outputs against a reference model (usually GPT-4) on real-world user instructions
TopK Activation: An activation function that keeps only the K largest values in a vector and sets the rest to zero, enforcing sparsity
JumpReLU: An activation function that zeroes out values below a threshold and passes values above it linearly, used here to rectify SAE activations
Monosemanticity: The property of a neuron or feature responding to exactly one specific concept (e.g., a specific syntax or topic) rather than multiple unrelated concepts