Faster sorting algorithms discovered using deep reinforcement learning

📝 Paper Summary

Program Synthesis Algorithm Optimization Deep Reinforcement Learning

AlphaDev treats algorithm discovery as a single-player game, using reinforcement learning to generate low-level assembly code that outperforms human-optimized sorting and hashing benchmarks.

Core Problem

Fundamental algorithms like sorting are already highly optimized by human experts, making it extremely difficult to find further efficiency gains using traditional compilation or manual tuning.

Why it matters:

Sorting and hashing are used trillions of times daily; even small efficiency gains accumulate into massive global computational savings.
Human intuition has reached a bottleneck in optimizing these low-level routines, and traditional superoptimization methods struggle with the combinatorial search space of assembly instructions.

Concrete Example: A standard Sort-3 implementation might use a comparator network that sorts three elements but includes redundant instructions. AlphaDev discovered a 'swap move' sequence that saves one instruction by recognizing that if B ≤ C is guaranteed, only min(A,B) is needed instead of min(A,B,C).

Key Novelty

AlphaDev (AssemblyGame RL Agent)

Formulates algorithm generation as a single-player game (AssemblyGame) where the agent builds a program instruction-by-instruction, receiving rewards for correctness and latency.
Uses a specialized neural architecture that combines a Transformer encoder for instruction sequences with a CPU state encoder to model register/memory dynamics during the search.
Optimizes for actual measured latency (not just proxy metrics like length) by running generated code on real hardware during the training loop.

Evaluation Highlights

Discovered fixed-sort algorithms (Sort 3, Sort 5) with fewer instructions than optimal human benchmarks, integrated into the LLVM C++ standard library.
Improved Variable Sort 5 (VarSort5) latency by approximately 5.7% (312k vs 331k ns) compared to human benchmarks.
Improved VarInt deserialization (Protocol Buffers) latency by roughly 3x compared to the human benchmark (97k vs 295k ns).

Breakthrough Assessment

10/10

Achieved the first change to the standard LLVM sort library in over a decade by automatically discovering algorithms that are objectively faster than human-optimized code.

⚙️ Technical Details

Problem Definition

Setting: Single-player game (AssemblyGame) defined by state St = <Pt, Zt> (program so far, memory/register state).

Inputs: Current assembly program state and CPU memory/register configurations.

Outputs: Next assembly instruction (Opcode + Operands) to append to the program.

Pipeline Flow

AlphaDev Agent (Selects Assembly Instruction)
Program Builder (Appends instruction to current P_t)
Correctness/Latency Evaluator (Computes Reward)
MCTS (Updates Policy/Value for next step)

System Modules

AlphaDev Representation Network

Encodes the current assembly program and CPU state into embeddings for the policy/value heads

Model or implementation: Dual-component network (Transformer Encoder + MLP)

AlphaZero/MuZero

Guides the search for the next instruction using MCTS and learned policy/value predictions

Model or implementation: AlphaZero-based RL agent

Novel Architectural Elements

Transformer-based instruction encoder adapted for assembly language (Opcode/Operand embedding)
CPU State Encoder (MLP) that inputs raw register/memory values to predict algorithm dynamics
Dual-head value function architecture predicting both 'correctness' and 'latency' separately to guide MCTS

Modeling

Base Model: Custom AlphaZero-based architecture with Transformer and MLP encoders

Training Method: AlphaZero / MuZero (Reinforcement Learning)

Objective Functions:

Purpose: Maximize cumulative reward combining correctness and latency.

Formally: R = Correctness + Latency Reward.
Purpose: Minimize prediction error for policy and value heads.

Formally: Standard AlphaZero loss (cross-entropy for policy, MSE for value).

Training Data:

Generated via self-play in the AssemblyGame environment

Key Hyperparameters:

batch_size: 1024 per TPU core
training_iterations: 1 million
TPU_cores: 16
+ 1 more
TPU_actors: 512

Compute: Train on TPU v.3 (16 cores), Actors on TPU v.4 (up to 512). Worst case training time: 2 days.

Comparison to Prior Work

vs. Stochastic Superoptimization: AlphaDev uses Deep RL (AlphaZero) to guide search rather than random MCMC sampling, exploring orders of magnitude fewer programs.
vs. Human Benchmarks: AlphaDev optimizes for measured latency and specific CPU architecture details, discovering non-intuitive moves (e.g., 'AlphaDev swap move') that humans missed.

Limitations

Search space grows exponentially; scaling to large sorts (e.g., Sort > 8) is computationally prohibitive.
Requires a definable correctness function (e.g., comparing to a ground truth sort), which limits applicability to verifiable domains.
Optimized for specific processor architectures (x86 in this paper); distinct retraining required for different instruction sets.

Reproducibility

The resulting algorithms (Sort 3, 4, 5) are integrated into LLVM (https://reviews.llvm.org/D118029). The paper details the RL environment (AssemblyGame), instruction set, and reward structure. Exact model weights are not provided, but the methodology is described in detail.

📊 Experiments & Results

Evaluation Setup

Optimization of fixed and variable length sorting algorithms and VarInt deserialization on x86 architecture.

Benchmarks:

Fixed Sort (Sort 3, 4, 5) (Sorting fixed-length sequences)
Variable Sort (VarSort 3, 4, 5) (Sorting variable-length sequences)
VarInt Deserialization (Protocol Buffer integer decoding)

Metrics:

Algorithm Length (number of assembly instructions)
Latency (nanoseconds, 5th percentile over 100 machines)
Statistical methodology: Confidence intervals for latency computed (95% CI for 5th percentile)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AlphaDev discovers algorithms with fewer instructions than optimal human benchmarks for fixed length sorts.
Sort 3	Instruction Count	18	17	-1
Sort 5	Instruction Count	46	42	-4
When optimizing for actual latency (VarSort), AlphaDev outperforms benchmarks significantly.
VarSort3	Latency (ns)	246040	236498	-9542
VarSort5	Latency (ns)	331198	312079	-19119
AlphaDev generalizes to non-sorting domains like VarInt deserialization.
VarInt	Latency (ns)	295358	97184	-198174

Main Takeaways

AlphaDev discovered novel algorithmic primitives (Swap Move, Copy Move) that improve upon optimal sorting networks.
For variable sorts, AlphaDev discovered a fundamentally different branch structure (simplifying Sort 4 calls) compared to human implementations.
Reverse-engineered C++ implementations of AlphaDev's assembly code led to up to 70% improvements for specific sequence lengths in the LLVM library.
Generalizes to VarInt decoding, discovering a new 'assignment move' that combines operations.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (AlphaZero/MuZero)
Computer Architecture (Assembly, Registers, Latency)
Sorting Networks

Key Terms

AssemblyGame: A single-player game formulation where the player selects low-level CPU instructions to construct a correct and efficient algorithm

AlphaZero: A reinforcement learning algorithm that masters games through self-play using MCTS (Monte Carlo Tree Search) and a neural network guide

MCTS: Monte Carlo Tree Search—a search algorithm used to make decisions by simulating many future game states

LLVM: A collection of modular and reusable compiler and toolchain technologies; its standard C++ library is used by millions

Sorting Network: A comparison-based sorting algorithm where the sequence of comparisons is fixed and data-independent, typically implemented as branchless code

VarInt: Variable-width integer encoding used in Protocol Buffers to serialize integers efficiently

superoptimization: The process of finding the optimal code sequence for a loop-free sequence of instructions

Opcode: The portion of a machine language instruction that specifies the operation to be performed (e.g., MOV, CMP)

latency: The time delay between the cause and the effect of some physical change in the system being observed (execution speed)