_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by injecting small, trainable rank-decomposition matrices while freezing the pre-trained weights.
LRLs: Low-Resource Languages—languages with limited available digital data for training NLP models (e.g., Swahili).
Contrastive Learning: A learning paradigm that aims to pull similar items (positive pairs) close together in embedding space while pushing dissimilar items (negative pairs) apart.
Cosine Similarity: A metric used to measure how similar two vectors are, ranging from -1 (opposite) to 1 (identical).
Triplet Margin Loss: A loss function where the model learns to keep a positive example closer to an anchor than a negative example by a specific margin.
L2-normalization: Scaling a vector so that its length (Euclidean norm) is 1, ensuring similarity relies on direction rather than magnitude.
MWEs: Multi-Word Expressions—terms composed of multiple words that function as a single unit of meaning.