When Evolution Strategy Meets Language Models Tuning

📝 Paper Summary

Post-Fine-tuning (Alignment) On-policy optimization

Evolution Strategy Optimization (ESO) treats language model tuning as an evolutionary process, using gradients from online-generated outputs as directed mutations to align models with reward signals.

Core Problem

Supervised Fine-tuning suffers from exposure bias, while existing post-fine-tuning methods like PPO can be unstable and off-policy methods like DPO are prone to overfitting.

Why it matters:

Standard fine-tuning creates a mismatch between training (teacher forcing) and inference (autoregressive generation), degrading performance
Off-policy methods (DPO) can overfit to fixed preference datasets, reducing generalization to new prompts
Existing on-policy methods (PPO) are computationally expensive and complex to tune due to multiple auxiliary models (critic, reference)

Concrete Example: In human alignment tasks, a model trained with DPO might produce long, redundant responses to 'hack' the preference data, whereas on-policy methods are needed to explore and penalize such behaviors dynamically.

Key Novelty

Evolution Strategy Optimization (ESO)

Views parameter updates as evolutionary mutations: instead of random noise, it uses the gradient of generated sentences as 'biased' perturbations
Quantifies fitness by comparing a sample's reward to the average reward of a population of samples generated online
Updates model parameters to increase the probability of high-fitness samples relative to the population average

Architecture

Overview of the Evolution Strategy Optimization (ESO) approach

Evaluation Highlights

Outperforms standard SFT and on-policy baselines (RRHF, Unlike) on Dolly and Xsum datasets across GPT-2 and OPT architectures
Achieves comparable win-rate to PPO (40.7% vs 40.2%) on Anthropic-HH human alignment using Pythia-2.8B
Demonstrates superior cross-dataset generalization: models trained on Dolly generalize better to Self-Instruct and Vicuna benchmarks compared to baselines

Breakthrough Assessment

6/10

Offers a simpler, heuristic-free alternative to PPO for on-policy learning with good results. However, it still lags behind off-policy DPO in human alignment tasks due to reward model limitations.

⚙️ Technical Details

Problem Definition

Setting: Post-fine-tuning of autoregressive language models to maximize expected reward

Inputs: Input prompt x

Outputs: Generated response sequence a

Pipeline Flow

Language Model (Policy) generates k candidate responses
Reward Function/Model scores the k candidates
Optimization Step calculates fitness (score - mean) and updates weights

System Modules

Policy Model

Generate k diverse candidate responses for a given input

Model or implementation: Various (GPT-2, OPT, Pythia)

Reward Evaluator

Assign scalar scores to generated sequences

Model or implementation: ROUGE metric (for instruction tasks) or Learned Reward Model (for alignment)

Update Mechanism

Compute ESO loss and update model parameters

Model or implementation: Gradient Descent

Novel Architectural Elements

Biased Evolution Strategy Sampling: Replaces the random Gaussian noise of traditional NES with the gradient of the log-probability of model-generated sentences
Integration of Evolution Strategy loss as a regularizer during the final epoch of SFT

Modeling

Base Model: GPT-2 (340M), OPT (350M), Pythia (2.8B)

Training Method: Evolution Strategy Optimization (ESO)

Objective Functions:

Purpose: Maximize the likelihood of high-reward samples relative to the population average.

Formally: L_eso = Sum[(mean_reward - R(x, a_j)) * log π(a_j|x)]
Purpose: Combine SFT and ESO objectives.

Formally: L_overall = L_sft + λ * L_eso

Key Hyperparameters:

population_size_k: 4 (candidates per input)
lambda_coefficient: 0.05 (for ROUGE-based tasks)
sampling_temperatures: 0.5, 1.0, 1.5, 2.0 (during training sampling)
+ 2 more
epochs: 5
weight_decay: 0.05

Compute: NVIDIA A800 GPUs. Training time analysis on GPT-2 (340M): 3.2s/step for ESO vs 2.6s/step for Unlike and 4.5s/step for RRHF.

Comparison to Prior Work

vs. PPO: ESO is simpler (no critic model, fewer hyperparameters) and comparable in performance, though less stable for complex multi-objective tasks
vs. DPO: ESO is on-policy (samples from current model), preventing overfitting to static datasets, but requires a trained reward model
vs. NES [not cited in paper as direct baseline, but conceptual ancestor]: ESO uses gradients of generated text as 'biased' mutations rather than random parameter noise, making it feasible for high-dimensional LLMs

Limitations

Computational overhead due to online sampling of multiple candidate sentences (k candidates per step)
Reliance on a high-quality reward model; poor reward signals lead to suboptimal alignment (as seen in Anthropic-HH results)
Currently less effective than DPO for alignment tasks where preference data is noisy or mixed

Reproducibility

Code: https://github.com/boyellow/ESO

Publicly available code at https://github.com/boyellow/ESO. Datasets (Dolly, Xsum, Anthropic-HH) are public benchmarks. Reward model training details for human alignment are briefly described (DeBERTa-v3-large based).

📊 Experiments & Results

Evaluation Setup

Instruction following, text summarization, and human alignment

Benchmarks:

Databricks-Dolly-15k (Instruction Following)
XSum (Text Summarization)
Anthropic-HH (Human Alignment (Helpful and Harmless))
Self-Instruct (Cross-dataset Generalization)
Vicuna Benchmark (Cross-dataset Generalization)

Metrics:

ROUGE-1/2/L
GLEU (Google-BLEU for fluency)
Win Rate (GPT-4 Judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation on Dolly (instruction following) and XSum (summarization) shows ESO consistently outperforming baselines across metrics.
Databricks-Dolly-15k	ROUGE-L	27.31	29.09	+1.78
XSum	ROUGE-L	29.34	31.79	+2.45
Databricks-Dolly-15k	ROUGE-L	24.19	26.39	+2.20
Self-Instruct	ROUGE-L	14.90	15.82	+0.92
Anthropic-HH	Win Rate (vs Chosen)	40.2	40.7	+0.5

Experiment Figures

Bar chart comparing Win Rates of SFT, PPO, ESO, and DPO on Anthropic-HH

Main Takeaways

ESO consistently improves over SFT and ranking-based baselines (RRHF, DRL) on standard NLP metrics (ROUGE, GLEU)
Cross-dataset evaluation confirms that ESO-trained models generalize better to unseen tasks (Self-Instruct, Vicuna) than those trained with Unlike or RRHF
Ablation studies show that increasing population size k (candidates per input) from 4 to 8 improves performance (ROUGE-L 29.20 -> 30.58)
On-policy methods (ESO, PPO) struggle to match off-policy DPO on Anthropic-HH likely due to the difficulty of training a robust reward model on that specific dataset

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Evolution Strategies (ES) / Zero-order optimization
Gradient descent optimization

Key Terms

SFT: Supervised Fine-tuning—training models to predict the next token given ground truth history

Exposure Bias: The discrepancy where models are trained on ground truth history but must generate based on their own potentially erroneous predictions during inference

On-policy: Learning algorithms that optimize the model based on data generated by the current version of the model itself

Off-policy: Learning algorithms that optimize the model using a static dataset collected from a different policy (e.g., a previous version or humans)

NES: Natural Evolution Strategy—an optimization class that updates parameters by estimating gradients using random perturbations (mutations) and their fitness scores

Perturbation signal: In this paper, the gradient of the log-probability of a generated sentence, used as a 'mutation' direction in parameter space

PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to be small to ensure stability

DPO: Direct Preference Optimization—an off-policy method that optimizes preferences directly without an explicit reward model loop

RRHF: Rank Responses to Align Human Feedback—a method that aligns models by ranking candidate responses and optimizing their probabilities accordingly