MiLe Loss: Proposed loss function using Mutual Information Learning principles (entropy-based) to scale gradients
Focal Loss: A loss function designed for class imbalance that down-weights easy examples to focus on hard ones
Zipf's law: Empirical law stating that the frequency of any word is inversely proportional to its rank in the frequency table
PPL: Perplexity—a measurement of how well a probability model predicts a sample
multi-label classification: Classification tasks where multiple classes can be correct simultaneously (e.g., multiple valid next words)
information entropy: A measure of the uncertainty or randomness in a probability distribution; high entropy means the distribution is flat/uncertain
LLaMA: Large Language Model Meta AI—a state-of-the-art open foundation model architecture used here as the backbone
The Pile: A large-scale, diverse open-source text dataset used for training language models