_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
RoPE: Rotary Position Embeddings—a method for encoding positional information in transformers by rotating query and key vectors in space
GQA: Group Query Attention—an efficiency technique where multiple query heads share a single key-value head to reduce memory usage
SwiGLU: Swish-Gated Linear Unit—an activation function variant used in feed-forward layers for better performance
BPE: Byte Pair Encoding—a tokenization algorithm that iteratively merges the most frequent pair of bytes or characters
Curriculum Learning: A training strategy where the difficulty or distribution of training data is meaningfully ordered or scheduled over time
Upsampling: Artificially increasing the frequency of data from underrepresented classes (here, low-resource languages) during training
COMET: A neural framework for training machine translation evaluation models that correlates well with human judgment
RMSNorm: Root Mean Square Normalization—a normalization technique that re-scales inputs based on their root mean square, simpler than LayerNorm
Onion: ONe Instance ONly—a deduplication tool that removes documents containing high ratios of duplicate n-grams
FlashAttention: An algorithm that speeds up attention computation and reduces memory usage (implied by Llama 3 architecture context, though specific kernel not detailed)
Focus languages: The 17 specific European languages (e.g., Bosnian, Estonian, Ukrainian) targeted for equitable performance in this model