Preserving Diversity in Supervised Fine-Tuning of Large Language Models

📝 Paper Summary

Supervised Fine-Tuning (SFT) Language Model Alignment Diversity Preservation

GEM is a game-theoretic fine-tuning framework that regulates probability transfer to preserve output diversity, enabling better test-time scaling and reducing knowledge forgetting compared to standard cross-entropy.

Core Problem

Standard Cross-Entropy (CE) loss forces models to maximize the likelihood of specific target tokens while suppressing all other plausible alternatives, leading to reduced output diversity and knowledge forgetting.

Why it matters:

Reduced diversity hinders downstream tasks that rely on sampling multiple responses to find high-quality solutions (test-time scaling)
Suppressing alternative plausible outputs erodes pre-trained knowledge, causing 'alignment tax' or catastrophic forgetting
CE's 'all-to-one' probability transfer is misaligned with open-ended generation where multiple valid responses exist

Concrete Example: In the sentence 'I like coffee', CE penalizes the semantically valid token 'tea' to maximize 'coffee'. This disrupts learned relationships. Over time, the model forgets 'tea' is a valid alternative, reducing its ability to generate diverse valid responses.

Key Novelty

Game-theoretic Entropy Maximization (GEM)

Frames SFT as a distribution matching game with an auxiliary 'meta-controller' variable that regulates the flow of probability mass from source tokens to target tokens.
Introduces sparse updates (only adjusting pivot tokens) and adaptive termination (stopping when the target becomes the top prediction) to prevent distribution collapse.
Theoretically proves this game solves a reverse KL minimization problem with maximum entropy regularization, explicitly balancing data fitting with diversity.

Architecture

Illustration of the GEM update mechanism compared to Cross-Entropy (CE).

Evaluation Highlights

+5.0 points improvement in chatting capability and +8.0 points in code generation on Llama-3.1-8B via test-time scaling (Best-of-N sampling)
Achieves comparable performance to baselines while using only 0.5x the sampling budget (N) due to higher quality diversity
Reduces 'alignment tax' (forgetting of pre-trained knowledge) by 83% compared to standard Cross-Entropy fine-tuning

Breakthrough Assessment

8/10

Offers a theoretically grounded alternative to the ubiquitous Cross-Entropy loss for SFT. Strong empirical results on diversity and forgetting, with significant implications for test-time scaling strategies.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning of Large Language Models aimed at distribution matching

Inputs: Input prompt x

Outputs: Generated response y (sequence of tokens)

Pipeline Flow

Prompt Input
Model Forward Pass (compute logits)
GEM Optimization Step (calculate sparse update & check termination)
Parameter Update

System Modules

Language Model

Generate logits for the next token given context

Model or implementation: Transformer (e.g., Llama-3-8B, Llama-2-7B/13B)

GEM Controller

Determine gradients based on game-theoretic formulation

Model or implementation: Algorithmic logic (auxiliary variable q)

Novel Architectural Elements

Introduction of a meta-controller (auxiliary variable q) in the loss computation loop that dynamically determines which logits to update and when to stop updating for a given sample

Modeling

Base Model: Llama-2 (7B, 13B), Llama-3-8B, Llama-3.1-8B, Llama-3-70B

Training Method: GEM (Game-theoretic Entropy Maximization)

Objective Functions:

Purpose: Maximize log-probability of real data while minimizing log-probability of generated data, regulated by auxiliary variable q.

Formally: min_theta max_q E[log f_theta(y_real|x) - log f_theta(y_gene|x)] - 1/beta * H(q)
Purpose: Theoretical equivalent: Minimize reverse KL divergence with entropy regularization.

Formally: min_theta KL(f_theta || p) - gamma * H(f_theta)

Adaptation: Full fine-tuning (implied by context of SFT baselines)

Key Hyperparameters:

beta: 0.5 (controls diversity regularization strength)
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
+ 1 more
epochs: 3

Compute: Computationally efficient as CE (uses efficient properties of LLMs to solve the game)

Comparison to Prior Work

vs. CE: GEM uses sparse updates and adaptive termination to prevent distribution collapse, whereas CE uses dense updates and unbounded probability transfer.
vs. Label Smoothing: GEM dynamically preserves the model's own top predictions (pivotal tokens) rather than smoothing uniformly across the vocabulary.
vs. Standard Reverse KL [not cited in paper]: Standard Reverse KL is intractable for LLMs due to summation over vocabulary; GEM makes it tractable via the game-theoretic formulation and finite generation space.

Limitations

Analysis relies on beta > 0 for unique Nash equilibrium; beta=0 case is less theoretically tractable.
Requires access to logits, which may be restricted in some API-based models.
Theoretical framework assumes finite generation space (valid for tokens, less so for continuous domains).

Reproducibility

Code: https://github.com/liziniu/GEM

Code is publicly available at https://github.com/liziniu/GEM. Detailed hyperparameters for all experiments (LR, batch size) are not explicitly detailed in the main text but code is provided.

📊 Experiments & Results

Evaluation Setup

Fine-tuning pre-trained models on instruction following and code generation datasets, evaluated using greedy decoding and Best-of-N sampling.

Benchmarks:

AlpacaEval 2.0 (Instruction Following / Chat)
HumanEval (Code Generation)
MBPP (Code Generation)
MMLU (General Knowledge (used to measure forgetting))
GSM8K (Math Reasoning (used to measure forgetting))

Metrics:

Win Rate (LC) - AlpacaEval
Pass@1
Pass@10
Pass@100
Alignment Tax (drop in MMLU/GSM8K performance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Test-time scaling results on Llama-3-8B-Instruct show GEM outperforms CE in chat and code tasks when sampling multiple responses (Best-of-N).
AlpacaEval 2.0	Win Rate (LC)	35.0	40.0	+5.0
Code Generation (HumanEval/MBPP aggregated)	Pass@K	Not reported in the paper	Not reported in the paper	+8.0
Forgetting analysis shows GEM preserves pre-trained knowledge significantly better than CE.
MMLU & GSM8K	Alignment Tax Reduction	100	17	-83

Experiment Figures

Conceptual diagram of the SFT process and the issue of diversity reduction.

Visualizes the auxiliary variable q as a shifted/sharpened distribution derived from the model f.

Main Takeaways

GEM significantly enhances output diversity compared to CE, allowing models to generate distinct valid reasoning paths or styles.
Higher diversity translates directly to better test-time compute scaling: GEM achieves similar performance to CE with much fewer samples (0.5x budget).
Preserving diversity acts as a regularizer against catastrophic forgetting, drastically reducing the 'alignment tax' typically seen with CE fine-tuning.
The method is computationally efficient and scales to 70B parameter models without significant overhead.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) using Cross-Entropy (CE) loss
Kullback-Leibler (KL) divergence (forward vs. reverse)
Game theory concepts (Nash equilibrium)
Test-time scaling (Best-of-N sampling)

Key Terms

SFT: Supervised Fine-Tuning—adapting a pre-trained model to a specific task using labeled examples

CE: Cross-Entropy—the standard loss function that maximizes the likelihood of the correct label

GEM: Game-theoretic Entropy Maximization—the proposed training algorithm that preserves diversity

reverse KL: Reverse Kullback-Leibler divergence (KL(model || data))—a distribution distance metric that tends to cover modes rather than mean-seeking, often harder to optimize than forward KL

test-time scaling: Improving performance during inference by generating multiple samples and selecting the best one (often via a reward model or verifier)

alignment tax: The degradation of a model's general capabilities or pre-trained knowledge resulting from fine-tuning on a specific task

logit: The raw, unnormalized output scores of the neural network before applying the softmax function

Best-of-N: A sampling strategy where N different responses are generated, and the best one is selected based on a scoring mechanism

sparse update: Updating only a subset of parameters or token probabilities (specifically pivot tokens) rather than the entire vocabulary distribution

adaptive termination: Stopping the optimization for a specific sample once a condition is met (e.g., target token has highest probability) to prevent overfitting