_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MIB: Mechanistic Interpretability Benchmark—the underlying framework used for this shared task
IOI: Indirect Object Identification—a task where the model must identify the indirect object in a sentence (e.g., 'John gave an apple to Mary')
CPR: Integrated Circuit Performance Ratio—metric measuring if a circuit includes components that positively affect task performance (higher is better)
CMD: Integrated Circuit-Model Distance—metric measuring if a circuit yields the same strength of preference as the full model (0 is best)
DAS: Distributed Alignment Search—a method for finding linear subspaces in model activations that correspond to high-level causal variables
SAE: Sparse Autoencoder—an unsupervised method for decomposing model activations into sparse, interpretable features
EAP: Edge Attribution Patching—an efficient approximation of activation patching to estimate the causal importance of edges
IG: Integrated Gradients—an attribution method that aggregates gradients along a path from a baseline to the input to assign importance scores
RAVEL: Resolving Attribute-Value Entanglements in LMs—a dataset evaluating methods for isolating specific attributes of entities
Activation Patching: A technique where a model's internal activation is replaced with an activation from a different input to test its causal role
Faithfulness: A measure of how accurately a circuit or causal variable replicates the full model's behavior under intervention
ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions used to evaluate knowledge and reasoning
MCQA: Multiple-Choice Question Answering—a task format where models select the correct answer from a list of options