ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

📝 Paper Summary

Memory-Efficient Fine-Tuning Zeroth-Order Optimization Mathematical Reasoning

ESSAM integrates Sharpness-Aware Maximization into Evolution Strategies to fine-tune LLMs for mathematical reasoning, achieving performance comparable to gradient-based RL while using only inference-level GPU memory.

Core Problem

Reinforcement Learning (RL) fine-tuning for LLMs requires prohibitive GPU memory due to gradient and optimizer state storage, while existing memory-efficient Evolution Strategies (ES) suffer from poor generalization on complex tasks.

Why it matters:

Fine-tuning an 8B model with PPO requires ~314 GB of GPU memory, making it inaccessible to researchers with limited resources
Standard zeroth-order methods (ES) tend to converge to sharp minima in the loss landscape, leading to brittle solutions that fail on unseen mathematical problems
Bridging the gap between memory efficiency and reasoning performance is critical for democratizing LLM alignment

Concrete Example: When fine-tuning LLaMA-3.1-8B on GSM8K, PPO consumes hundreds of gigabytes of memory. Standard ES fits in memory but achieves lower accuracy (75.97% avg). ESSAM fits in memory and matches PPO performance (78.27% avg).

Key Novelty

Evolution Strategies with Sharpness-Aware Maximization (ESSAM)

Introduces a 'look-ahead' step to zeroth-order optimization: instead of updating parameters directly based on current rewards, the method first perturbs parameters towards a 'sharpness-aware' neighborhood
Performs a two-stage evaluation: first to find the direction of the flat region (SAM update), and second to estimate the gradient at that robust location for the final update
Uses strictly forward passes (generation and scoring) to estimate updates, avoiding backpropagation and gradient storage entirely

Architecture

The conceptual workflow of ESSAM comparing the update mechanism to standard ES.

Evaluation Highlights

Achieves 78.27% average accuracy across 7 models on GSM8K, outperforming standard Evolution Strategies (75.97%) and PPO (77.72%)
Reduces GPU memory usage by 18x compared to PPO and 10x compared to GRPO on average, maintaining constant inference-level memory footprint
Matches the performance of GRPO on the Qwen-2.5-7B-Instruct model (ESSAM: 92.57% vs GRPO: 92.70%) while running on significantly less hardware

Breakthrough Assessment

8/10

Significantly closes the performance gap between memory-efficient zeroth-order methods and resource-heavy RL. The 18x memory reduction while matching PPO is a major practical enabler.

⚙️ Technical Details

Problem Definition

Setting: Full-parameter fine-tuning of Large Language Models to maximize a non-differentiable reward function

Inputs: Pre-trained LLM parameters, Mathematical questions (GSM8K)

Outputs: Fine-tuned LLM parameters maximizing reasoning accuracy

Pipeline Flow

Perturbation Stage: Generate N noise vectors -> Evaluate rewards -> Aggregate to find neighborhood direction
SAM Step: Perturb model to neighborhood point (theta_SAM)
Evaluation Stage: Generate N new noise vectors at theta_SAM -> Evaluate rewards
Update Step: Aggregate second-stage results -> Update original parameters

System Modules

Perturbation Generator

Generate Gaussian noise vectors to create a population of perturbed models

Model or implementation: N/A (Gaussian Sampling)

Reward Evaluator

Execute model forward passes to generate answers and compute rewards

Model or implementation: Target LLM (e.g., Qwen-2.5, Llama-3)

SAM Updater

Compute the 'sharpness-aware' update direction by aggregating noise weighted by rewards

Model or implementation: N/A (Arithmetic)

Novel Architectural Elements

Integration of SAM's two-step 'look-ahead' mechanism into the zeroth-order ES update loop
Two-stage reward evaluation process explicitly designed to guide the search towards flatter minima in parameter space

Modeling

Base Model: Qwen-2.5 (0.5B, 1.5B, 3B, 7B) and Llama-3 (1B, 3B, 8B)

Training Method: Evolution Strategies with Sharpness-Aware Maximization (ESSAM)

Objective Functions:

Purpose: Maximize the expected reward of the model output.

Formally: Maximize E[R(theta + epsilon)] using zeroth-order estimation.
Purpose: Encourage flatter minima for better generalization.

Formally: Update theta using gradients estimated at a perturbed neighborhood point theta_SAM.

Adaptation: Full parameter fine-tuning

Training Data:

GSM8K training set
Standard split: training data shuffled, multi-step updates with mini-batches

Key Hyperparameters:

perturbation_size_rho: Value used to perturb parameters to SAM neighborhood (specific value not reported in text snippet)
noise_std_sigma: Standard deviation for Gaussian noise (specific value not reported in text snippet)
mini_batch_size: Small mini-batches used (specific size not reported in text snippet)

Compute: Inference-level GPU memory usage. 18x less memory than PPO on average.

Comparison to Prior Work

vs. PPO/GRPO: ESSAM uses zeroth-order optimization (no gradients), requiring drastically less memory (18x less than PPO)
vs. Standard ES: ESSAM adds a SAM-inspired look-ahead step to find flatter minima, improving generalization on math tasks where ES typically fails
vs. MeZO [not cited in paper]: MeZO is a memory-efficient zeroth-order optimizer; ESSAM extends similar principles specifically with SAM for RL-fine-tuning contexts

Limitations

Runtime is approximately 2x that of GRPO/PPO per iteration due to two-stage evaluation (sampling twice)
Performance on very large models (>8B) or non-math domains is not explored
Requires efficient inference infrastructure (vLLM) to make the population-based sampling practical

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using Chain-of-Thought prompting

Benchmarks:

GSM8K (Grade School Math Word Problems)

Metrics:

Accuracy (Exact Match of final answer)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ESSAM consistently outperforms standard ES and matches or exceeds RL baselines across various model sizes.
GSM8K	Average Accuracy (All Models)	75.97	78.27	+2.30
GSM8K	Average Accuracy (All Models)	77.72	78.27	+0.55
GSM8K	Accuracy	Not reported as exact number in summary text but ESSAM is 'outperforming PPO'	92.57	Positive (Qualitative)
GSM8K	Accuracy	Lower than ESSAM (implied)	78.92	Positive (Qualitative)

Experiment Figures

Training reward curves comparing ESSAM and Standard ES.

Bar chart comparing GPU memory usage of ESSAM, PPO, and GRPO across different model sizes.

Main Takeaways

ESSAM achieves RL-level performance (comparable to PPO/GRPO) using only inference-level memory, democratizing full fine-tuning.
The integration of SAM significantly improves generalization over standard ES, as evidenced by higher test accuracy despite similar training reward curves.
Memory savings are substantial (18x vs PPO), scaling favorably with model size since no gradients or optimizer states are stored.
Convergence is faster in terms of iterations compared to ES, though per-iteration time is higher due to double sampling.

📚 Prerequisite Knowledge

Prerequisites

Evolution Strategies (ES)
Reinforcement Learning (RL)
Sharpness-Aware Maximization (SAM)

Key Terms

ES: Evolution Strategies—a family of optimization algorithms that use a population of candidate solutions (perturbed models) to estimate updates without calculating gradients

SAM: Sharpness-Aware Maximization—an optimization technique that seeks parameters lying in 'flat' minima (low loss neighborhoods) to improve generalization

Zeroth-order optimization: Optimization methods that rely only on function values (forward passes) rather than gradient information (backpropagation)

PPO: Proximal Policy Optimization—a gradient-based reinforcement learning algorithm that stabilizes training using a clipped surrogate objective

GRPO: Group Relative Policy Optimization—a variant of PPO that normalizes advantages within a group of outputs to reduce variance and remove the need for a large value network

KV-cache: Key-Value cache—storage of computed attention representations during LLM inference to speed up token generation

GSM8K: Grade School Math 8K—a benchmark dataset of 8.5k high quality linguistically diverse grade school math word problems