MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

📝 Paper Summary

Language Model Pre-training Loss Functions Class Imbalance in NLP

MiLe Loss dynamically scales training gradients based on the entropy of the predicted probability distribution to focus learning on infrequent, difficult-to-learn tokens rather than easy, frequent ones.

Core Problem

Language models are dominated by frequent, easy-to-learn tokens due to Zipfian distributions in training data, causing them to neglect infrequent, informative tokens.

Why it matters:

Standard Cross-Entropy Loss treats all tokens equally, allowing the vast number of easy tokens to overwhelm the training signal
Existing solutions like Focal Loss fail in language modeling because next-token prediction is often a multi-label problem (multiple valid next tokens), not a single-class problem
Models trained on imbalanced data exhibit high perplexity on rare tokens, indicating poor understanding of the 'long tail' of language

Concrete Example: Given 'I like playing ___', valid completions include 'basketball', 'football', and 'golf'. If 'basketball' is the target, it has low probability (e.g., 0.18) because probability mass is split among valid options. Focal Loss misinterprets this low probability as 'difficult' and upweights it excessively, even though the model isn't actually confused—it just sees multiple valid options.

Key Novelty

Entropy-based Dynamic Loss Scaling

Use the information entropy of the model's predicted probability distribution to assess difficulty, rather than just the target token's probability
High entropy implies the model is unsure (distribution is flat), indicating a truly difficult-to-learn token; low entropy implies the model is confident (distribution is peaked), indicating an easy token
Scale the loss weight proportionally to this entropy, forcing the model to pay more attention to uncertain, difficult contexts

Architecture

Conceptual illustration of the 'Next Token Prediction as Multi-Label Classification' problem.

Evaluation Highlights

Consistent gains across 8 common sense reasoning benchmarks: 6.7B model with MiLe Loss improves +1.02% average accuracy over Cross-Entropy baseline in 5-shot setting
+4.17% improvement in Zero-shot Exact Match on TriviaQA (6.7B model) compared to Focal Loss
Reduces perplexity specifically for 'medium' and 'difficult' frequency tokens (e.g., from 15.517 to 15.371 for difficult tokens) while maintaining performance on easy tokens

Breakthrough Assessment

6/10

A solid, mathematically grounded improvement over Cross-Entropy for LM pre-training. While the gains are consistent, they are relatively modest (1-2%). The entropy-based insight for multi-label ambiguity is clever.

⚙️ Technical Details

Problem Definition

Setting: Generative Language Model Pre-training (Next Token Prediction)

Inputs: Sequence of previous tokens t = [t_1, ..., t_{i-1}]

Outputs: Probability distribution p over vocabulary V for next token t_i

Pipeline Flow

Tokenize Input Text
Transformer Backbone (Compute Hidden States)
Linear Projection + Softmax (Compute Probabilities p)
Calculate Entropy of p
Compute Scaling Factor (1 - Entropy)
Compute MiLe Loss
Backpropagate

System Modules

Tokenizer

Convert text to token IDs

Model or implementation: LLaMA tokenizer (32k vocabulary)

Transformer Backbone

Compute contextual representations

Model or implementation: LLaMA-based architectures (468M, 1.2B, 6.7B parameters)

Entropy Scaler

Compute dynamic weight for current token based on prediction uncertainty

Model or implementation: Mathematical function (Entropy calculation)

Novel Architectural Elements

None (Architecture is standard Transformer; innovation is in the loss function)

Modeling

Base Model: Custom models following LLaMA architecture (468M, 1.2B, 6.7B parameters)

Training Method: Pre-training from scratch

Objective Functions:

Purpose: Scale loss based on prediction entropy to focus on hard tokens.

Formally: L_MiLe = -(1 - Σ p_j log(p_j))^γ * log(p_{t_i})
Purpose: Standard baseline objective.

Formally: L_CE = -log(p_{t_i})
Purpose: Baseline for addressing imbalance.

Formally: L_FL = -(1 - p_{t_i})^γ * log(p_{t_i})

Training Data:

The Pile dataset (825GB text)
Sampled using domain weights (e.g., PubMed Central 0.2823, ArXiv 0.1997)

Key Hyperparameters:

learning_rate: 3.0e-4
batch_size: 2048 (for 6.7B model)
seq_length: 2048 (for 6.7B model)
+ 4 more
optimizer: AdamW
warmup_steps: 2000
gamma: 1.0 (default for MiLe Loss)
training_tokens: 100B (default) up to 200B

Compute: Not reported in the paper

Comparison to Prior Work

vs. Focal Loss: Uses global entropy of distribution rather than just target token probability. Handles multi-modal distributions (synonyms) better.
vs. Cross-Entropy: dynamically upweights hard/uncertain tokens.
vs. CB Loss (Class-Balanced Loss) [not cited in paper]: CB Loss reweights based on static class frequency counts; MiLe Loss reweights based on dynamic model uncertainty per instance.

Limitations

Computational overhead of calculating entropy over full vocabulary at every step
Sensitivity to noisy data: outliers with high entropy might be excessively upweighted
Only evaluated on English text (The Pile)
Performance gain is consistent but small (typically <2%)

Reproducibility

Code: https://github.com/EleutherAI/lm-evaluation-harness

Uses standard open datasets (The Pile) and evaluation harness (lm-evaluation-harness). Model architectures mimic Pythia/LLaMA. Loss function formula is explicit. Code for the loss itself is not linked, but implementation is straightforward from Equation 6.

📊 Experiments & Results

Evaluation Setup

Pre-training from scratch on 100B/200B tokens, followed by zero-shot and few-shot evaluation on downstream tasks.

Benchmarks:

Common Sense Reasoning Suite (8 datasets including BoolQ, HellaSwag, PIQA, WinoGrande)
TriviaQA / WebQuestions (Closed-book Question Answering)
MMLU (Massive Multitask Language Understanding)

Metrics:

Accuracy
Exact Match (EM)
Perplexity (PPL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Common Sense Reasoning results (Average of 8 datasets) showing scaling benefits.
Common Sense Avg (5-shot)	Accuracy	58.66	59.68	+1.02
Common Sense Avg (0-shot)	Accuracy	49.14	49.93	+0.79
Closed-Book QA results highlighting significant gains in factual retrieval.
TriviaQA (0-shot)	Exact Match	17.09	20.64	+3.55
Token-level analysis showing where the perplexity improvements come from.
Pile Validation Set	Perplexity (Difficult Tokens)	15.517	15.371	-0.146
Extended training results showing gains scale with more data.
Common Sense Avg (5-shot)	Accuracy	61.75	63.08	+1.33

Experiment Figures

Grid search results for the gamma hyperparameter across model sizes.

Perplexity breakdown by token difficulty bucket (Easy, Medium, Difficult).

Main Takeaways

MiLe Loss consistently outperforms Cross-Entropy and Focal Loss across diverse benchmarks (Reasoning, QA, MMLU), verifying its generalizability.
The method is particularly effective for 'medium' and 'difficult' frequency tokens, reducing their perplexity without significantly hurting 'easy' tokens.
Benefits of MiLe Loss increase with more training data (gap widens from 1.33% vs 1.02% when doubling data), suggesting it helps models learn more efficiently from the long tail.
Focal Loss often underperforms Cross-Entropy on zero-shot tasks (e.g., TriviaQA), likely because it penalizes valid synonyms in multi-label contexts; MiLe Loss avoids this pitfall.

📚 Prerequisite Knowledge

Prerequisites

Cross-Entropy Loss
Focal Loss
Information Entropy
Zipf's Law

Key Terms

MiLe Loss: Proposed loss function using Mutual Information Learning principles (entropy-based) to scale gradients

Focal Loss: A loss function designed for class imbalance that down-weights easy examples to focus on hard ones

Zipf's law: Empirical law stating that the frequency of any word is inversely proportional to its rank in the frequency table

PPL: Perplexity—a measurement of how well a probability model predicts a sample

multi-label classification: Classification tasks where multiple classes can be correct simultaneously (e.g., multiple valid next words)

information entropy: A measure of the uncertainty or randomness in a probability distribution; high entropy means the distribution is flat/uncertain

LLaMA: Large Language Model Meta AI—a state-of-the-art open foundation model architecture used here as the backbone

The Pile: A large-scale, diverse open-source text dataset used for training language models