_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
CSP: Constraint Satisfaction Problem—a mathematical framework where the goal is to find a state that satisfies a number of constraints or criteria
SAT Probe: The proposed method (Satisfaction Probe) that uses a classifier on attention weights to predict if a constraint is satisfied
Constraint Tokens: Tokens in the prompt representing the specific entity or condition the model must adhere to (e.g., the name 'Steven Spielberg' in a query about his movies)
Mechanistic Interpretability: A field of AI research focused on reverse-engineering the internal components (neurons, layers, attention heads) of neural networks to understand how they implement specific behaviors
Lasso Regression: A linear regression method that performs variable selection and regularization to enhance prediction accuracy and interpretability
WikiData: A collaborative, multilingual knowledge graph hosted by the Wikimedia Foundation
Spearman's Correlation: A statistical measure of the strength and direction of a monotonic relationship between two ranked variables