_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
_example: {'RAG': 'Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents', 'F1 score': 'A metric balancing precision (are answers correct?) and recall (are answers complete?)', 'PPO': 'Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps using a clipped objective', 'parameter sharing': 'Multiple agents use the same underlying model weights, reducing memory and enabling coordination', 'warm start': 'Pre-training each module on labeled examples before switching to reinforcement learning, so agents start from a competent baseline'}
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
Soft thought tokens: Continuous vector representations (hidden states) used as reasoning steps, rather than discrete words from a vocabulary
Projection module: A trainable neural network layer that maps embeddings from one model's vector space to another model's vector space
Catastrophic forgetting: The phenomenon where a machine learning model loses previously learned knowledge or capabilities when fine-tuned on new data
Hard-CoT: Traditional Chain-of-Thought reasoning where intermediate steps are generated as discrete, human-readable text tokens
Soft-CoT: Reasoning approaches where intermediate steps are represented as continuous vectors (latent states) typically not human-readable
Coconut: Chain of Continuous Thought—a prior method that trains the LLM to reason in continuous space, often requiring full fine-tuning
NLL: Negative Log-Likelihood—a loss function used to train language models by maximizing the probability of the correct next token
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters
Hidden states: The internal vector representations of data within a neural network layer