Learning Dynamics of LLM Finetuning

📝 Paper Summary

LLM Alignment Learning Dynamics Preference Optimization

The paper analyzes how updating LLM parameters affects predictions on other examples, revealing a 'squeezing effect' in off-policy DPO where negative gradients on unlikely responses depress confidence across all outputs.

Core Problem

Understanding why specific finetuning algorithms (like DPO) cause phenomena like hallucinations, repetition, or decaying confidence in desired outputs, which are not explained by static loss analysis.

Why it matters:

Off-policy DPO often causes model confidence to decay for both chosen and rejected responses, degrading performance over time
Standard static analysis (viewing the final loss landscape) fails to explain dynamic behaviors like the 'zig-zag' learning path or why on-policy sampling is crucial
Hallucinations often increase after finetuning, but the mechanism for why specific types of hallucinations (e.g., mixing up facts between questions) occur is poorly understood

Concrete Example: In off-policy DPO, applying a large negative gradient to a rejected response 'y-' that the model already considers unlikely causes the probability of *all* other responses (including the chosen one) to drop, squeezing mass into the single most-likely token (often leading to repetitive loops).

Key Novelty

Learning Dynamics Decomposition & Squeezing Effect

Decomposes the change in prediction into three terms: current prediction matrix, empirical Neural Tangent Kernel (similarity between examples), and the residual vector (direction of update)
Identifies the 'squeezing effect': when softmax models receive negative gradients on unlikely labels, probability mass is disproportionately shifted to the single most confident label, rather than distributed broadly
Demonstrates that SFT on '[question, chosen]' increases confidence for 'chosen' AND 'rejected' responses (due to similarity), whereas DPO pushes down 'rejected' and indirectly its neighbors

Architecture

Conceptual visualization of learning dynamics on MNIST to illustrate the framework.

Evaluation Highlights

The proposed 'extend' method (training on negative responses during SFT) achieves a ~69% win rate against baseline DPO (4 epochs) on Pythia-2.8B evaluated by ChatGPT
In off-policy DPO, the log-probability of the chosen response 'y+' typically drops from -90 to -113, while greedy decoding probability spikes from -113 to -63 (indicating peaking/squeezing)
Standard SFT increases confidence on hallucinated responses (answers from other training questions) steadily, verifying the cross-sample influence mechanism

Breakthrough Assessment

8/10

Provides a strong theoretical framework explaining mysterious empirical phenomena (DPO confidence decay, repeater problem). The 'squeezing effect' is a fundamental insight into softmax-based learning dynamics.

⚙️ Technical Details

Problem Definition

Setting: Analysis of gradient descent dynamics on Large Language Models during Supervised Finetuning (SFT) and Direct Preference Optimization (DPO)

Inputs: Training set D containing prompts x, chosen responses y+, and rejected responses y-

Outputs: Updates to model parameters θ and resulting changes in probability distribution π_θ(y|x)

Pipeline Flow

Theoretical Decomposition (Derive update rules for SFT/DPO)
Probing Dataset Construction (Generate variations of responses: rephrased, irrelevant, random)
Dynamics Tracking (Monitor log-probs of probe responses during training)
Proposed Method 'Extend' (Augment SFT data -> Run DPO)

System Modules

Dynamics Decomposition

Mathematically decomposes the change in log-probabilities into Attention, Kernel (similarity), and Gradient components

Model or implementation: Mathematical formulation (Proposition 1)

Probing Monitor

Tracks model confidence on a set of constructed responses to visualize influence propagation

Model or implementation: Pythia-410M/1B/1.4B/2.8B, Qwen1.5-0.5B/1.8B

Novel Architectural Elements

Proposed 'Extend' training pipeline: Augment SFT stage with [x, y-] pairs (standard likelihood maximization on negative samples) before running DPO, to mitigate squeezing effect

Modeling

Base Model: Pythia (410M to 2.8B) and Qwen1.5 (0.5B, 1.8B)

Training Method: SFT followed by DPO (and the proposed 'Extend' variant)

Objective Functions:

Purpose: SFT Loss.

Formally: L_SFT = - Σ log π(y_l | y_<l, x)
Purpose: DPO Loss.

Formally: L_DPO = -E [log σ(β * log(π(y+|x)/π_ref(y+|x)) - β * log(π(y-|x)/π_ref(y-|x)))]

Adaptation: Full finetuning (implied by dynamics analysis)

Training Data:

Anthropic-HH (5000 train examples)
UltraFeedback (5000 train examples)
Probing set: 500 examples from train set with generated variations

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: 4 (for probing frequency calculation)
beta: Not explicitly reported in the paper
+ 1 more
epochs: Typically 6-10 epochs for analysis

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard DPO: The paper proposes 'Extend' (SFT on negatives first) to mitigate confidence decay caused by squeezing [novel pipeline]
vs. IPO/SPIN: Focuses on explaining *why* on-policy data helps (avoiding negative gradients on low-probability regions) rather than proposing a new loss function
vs. Static Analysis: Analyzes step-wise updates rather than final convergence properties

Limitations

Analysis relies on the assumption that relative influence (eNTK) is stable, which is verified empirically but may not hold indefinitely
Experiments primarily on smaller LLMs (up to 2.8B parameters)
The proposed 'Extend' method is counter-intuitive (training on rejected samples with SFT) and requires careful tuning to avoid teaching the model bad behaviors before DPO

Reproducibility

Code: https://github.com/Joshua-Ren/Learning_dynamics_LLM

Code available at github.com/Joshua-Ren/Learning_dynamics_LLM. Uses standard datasets (Anthropic-HH, UltraFeedback) and open models (Pythia, Qwen). Probing dataset construction (ChatGPT rephrasing) details provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Finetuning Pythia and Qwen models on Anthropic-HH and UltraFeedback. Measuring win-rates against baselines using LLM-as-a-judge.

Benchmarks:

Anthropic-HH (Dialogue preference modeling)
UltraFeedback (General instruction following preference)

Metrics:

Log-probability trajectories (dynamics)
Win-rate (vs Baseline) using ChatGPT and Claude 3
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Anthropic-HH (implied)	Win-rate vs Baseline	0.50	0.6928	+0.1928
Anthropic-HH (implied)	Win-rate vs Baseline	0.50	0.6045	+0.1045
Anthropic-HH	Log Probability (Teacher Forcing)	-113	-63	+50
Anthropic-HH	Log Probability (y+)	-90	-113	-23

Experiment Figures

Learning dynamics curves for SFT. Tracks log-probs of y+, y-, rephrases, and random sentences over epochs.

Learning dynamics curves for off-policy DPO.

Comparison of baseline DPO vs. 'Extend' method (SFT on negatives first).

Main Takeaways

Squeezing Effect: Applying negative gradients to unlikely responses (off-policy DPO) decreases probability of *all* distinct responses, pushing mass to the single greedy path.
Repeater Problem: The squeezing effect explains why models degenerate into repeating phrases; the probability mass is squeezed into the most likely tokens, forming self-reinforcing loops.
SFT Hallucination: SFT increases confidence not just on the target y+, but on responses from other training examples (y_j) due to non-zero kernel similarity, explaining data contamination hallucinations.
Benefit of On-Policy: On-policy methods work better because the negative response y- is sampled from the model (high probability), so the negative gradient doesn't trigger the squeezing effect as severely.

📚 Prerequisite Knowledge

Prerequisites

Neural Tangent Kernel (NTK) theory
Gradient Descent dynamics
RLHF / DPO formulations
Softmax function properties

Key Terms

DPO: Direct Preference Optimization—an algorithm optimizing language models to prefer 'chosen' over 'rejected' responses without an explicit reward model

SFT: Supervised Finetuning—training the model to maximize the likelihood of ground-truth responses

eNTK: Empirical Neural Tangent Kernel—a matrix representing how much updating the model on one example affects the prediction on another example based on gradient similarity

squeezing effect: A phenomenon where negative gradients on unlikely classes in softmax models force probability mass into the single highest-probability class, often leading to repetitive or degenerate output

learning dynamics: The study of how model predictions change step-by-step during training as a function of optimization updates

off-policy: Training on data generated by a different policy (e.g., a static dataset) rather than the model currently being trained

on-policy: Training on data generated by the current version of the model itself

hallucination: Model generation of incorrect or non-factual information, specifically analyzed here as 'facts from question B answering question A'

teacher forcing: A training technique where the model is fed the ground-truth previous token as input for the next step, rather than its own generation

residual term: The vector difference between the current prediction and the target, determining the direction of the gradient update (denoted as G_t in the paper)

logits: The raw, unnormalized scores output by the neural network before the softmax layer