The Viscosity of Logic: Phase Transitions and Hysteresis in DPO Alignment

📝 Paper Summary

Alignment Stability Direct Preference Optimization (DPO) Model Capability Collapse

Treating DPO alignment pressure as a control parameter reveals that reasoning capabilities can collapse suddenly in a phase transition and suffer permanent damage through hysteresis.

Core Problem

Standard DPO tuning assumes increasing alignment pressure smoothly improves behavior, but this proxy metric often anticorrelates with actual reasoning capabilities.

Why it matters:

Models selected for best alignment scores may have degraded logic and reasoning (Goodhart's Law)
The assumption that capability loss is reversible (annealing) is incorrect; high-pressure training causes permanent damage
Coarse hyperparameter sweeps miss narrow 'pockets' where models remain capable

Concrete Example: In Mistral-7B, a model trained at beta=0.01 maintains reasoning, but slightly increasing pressure to beta=0.015 causes a total collapse of logic capabilities that cannot be fixed even if pressure is later reduced.

Key Novelty

Thermodynamic Analysis of Alignment (Viscosity/Hysteresis)

Maps the 'capability landscape' of DPO by densely sweeping the beta parameter, treating it like temperature in physics to find critical points
Identifies 'hysteresis' in neural weights: models exposed to high alignment pressure retain damage even after pressure is lowered, unlike elastic materials

Architecture

Phase diagrams plotting Alignment Pressure (Beta) on the x-axis vs. Probe Margin (Capability) on the y-axis for Mistral-7B.

Evaluation Highlights

Strong anticorrelation (Pearson r = -0.91) between DPO preference margin and logic capability in LLaMA-2-7B, proving alignment scores can mislead
Mistral-7B logic capability is confined to a narrow 'pocket' (beta approx 0.006–0.009), outside of which it collapses
Training path dependence: 'Annealing' (high then low beta) yields significantly worse logic than constant low beta (p=0.032), confirming hysteresis

Breakthrough Assessment

8/10

Provides critical empirical evidence that alignment is not a smooth optimization path but a landscape with dangerous cliffs (phase transitions) and irreversible traps (hysteresis), challenging standard tuning practices.

⚙️ Technical Details

Problem Definition

Setting: Direct Preference Optimization (DPO) hyperparameter tuning and capability preservation

Inputs: Base Language Model, Preference Pairs (chosen/rejected)

Outputs: Aligned Policy Model

Pipeline Flow

Input Prompt
LLM (with LoRA adapters)
Log-Probability Calculation (Probe Scoring)

System Modules

Base LLM (Inference)

Foundational language model being aligned

Model or implementation: Mistral-7B, LLaMA-2-7B, or Qwen-1.5-7B

LoRA Adapters (Inference)

Trainable parameters updated during DPO to align behavior

Model or implementation: Rank 8, Alpha 16, Dropout 0.05

Modeling

Base Model: Mistral-7B-v0.1, Llama-2-7b-hf, Qwen1.5-7B

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to favor chosen responses over rejected ones while penalizing divergence from reference.

Formally: DPO loss with beta parameter scaling the KL-divergence constraint.

Adaptation: LoRA (rank=8, alpha=16, dropout=0.05)

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 4
train_steps: 200
+ 1 more
beta_sweep_range: Logarithmic sweep from approx 0.0005 to 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard DPO: Proposes beta as a critical control parameter for 'capability landscapes' rather than a simple hyperparameter to maximize
vs. PPO [not cited in paper]: Focuses on offline DPO stability; PPO is known for instability but this paper shows DPO has its own 'phase transition' failure modes

Limitations

Study limited to 7B parameter models; scaling laws for these phase transitions are unknown
Evaluation relies on fixed probe margins rather than free-form generation benchmarks (except GSM8K mention)
Focuses on a specific set of capabilities (logic, arithmetic, format) which may not represent all reasoning types

Reproducibility

The paper mentions a reproducibility archive for raw outputs but provides no URL in the text. Training uses standard open-weight models (Mistral, LLaMA-2, Qwen) and standard DPO hyperparameters.

📊 Experiments & Results

Evaluation Setup

Controlled DPO sweep starting from identical checkpoints

Benchmarks:

Logic Probes (Syllogistic reasoning and ordering)
Arithmetic Probes (Multi-digit operations)
Format Probes (JSON generation and boolean constraints)

Metrics:

Probe Margin (length-normalized log-prob difference)
DPO Preference Margin (optimization objective)
Roughness (training stability metric)
Statistical methodology: Paired t-tests for path dependence; Pearson correlation for margin-capability relationship

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Logic Probes (LLaMA-2-7B)	Pearson Correlation (r)	0	-0.91	-0.91
Mistral-7B Training	Roughness Drop	1.00	0.63	-0.37
Format Probes (Mistral-7B)	Probe Margin	0.0	4.44	+4.44
Format Probes (LLaMA-2-7B)	Probe Margin	0	-2.0	-2.0

Experiment Figures

Hysteresis loops comparing 'Quench' (Path A) vs 'Anneal' (Path B) trajectories.

Main Takeaways

Mistral-7B exhibits a 'logic-positive pocket' only within a narrow beta band (approx 0.006-0.009); outside this, logic fails
Capabilities form a hierarchy: surface behaviors (format/sycophancy) align at much lower pressures (beta ~ 0.0005) than deep reasoning (beta ~ 0.008)
Hysteresis is real: 'Annealing' (training at beta=0.02 then 0.01) results in significantly worse logic performance than training directly at beta=0.01, proving damage is irreversible under this protocol

📚 Prerequisite Knowledge

Prerequisites

Understanding of DPO (Direct Preference Optimization) and the KL-divergence constraint
Basic physics concepts (Phase Transition, Hysteresis) useful for the analogy
Familiarity with LLM evaluation via log-probability probes

Key Terms

DPO: Direct Preference Optimization—an alignment method optimizing a policy to satisfy preferences while staying close to a reference model via a KL penalty

Beta (β): The hyperparameter in DPO controlling the strength of the KL penalty; higher beta means stronger alignment pressure and less adherence to the base model

Hysteresis: A phenomenon where the state of a system depends on its history; here, a model remains damaged even after the cause of damage (high beta) is removed

Phase Transition: A sharp, discontinuous change in system behavior (e.g., capability collapse) occurring over a narrow range of a control parameter

Probe Margin: The difference in length-normalized log-probabilities between correct and incorrect answers on a specific capability task

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights