← Back to Paper List

Faster sorting algorithms discovered using deep reinforcement learning

D. Mankowitz, Andrea Michi, A. Zhernov, Marco Gelmi, M. Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alexa Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, R. Tung, Minjae Hwang, taylan. cemgil, M. Barekatain, Yujia Li, Amol Mandhane, T. Hubert, Julian Schrittwieser, D. Hassabis, Pushmeet Kohli, Martin A. Riedmiller, O. Vinyals, David Silver
DeepMind, Google
Nature (2023)
RL Reasoning Benchmark

📝 Paper Summary

Program Synthesis Algorithm Optimization Deep Reinforcement Learning
AlphaDev treats algorithm discovery as a single-player game, using reinforcement learning to generate low-level assembly code that outperforms human-optimized sorting and hashing benchmarks.
Core Problem
Fundamental algorithms like sorting are already highly optimized by human experts, making it extremely difficult to find further efficiency gains using traditional compilation or manual tuning.
Why it matters:
  • Sorting and hashing are used trillions of times daily; even small efficiency gains accumulate into massive global computational savings.
  • Human intuition has reached a bottleneck in optimizing these low-level routines, and traditional superoptimization methods struggle with the combinatorial search space of assembly instructions.
Concrete Example: A standard Sort-3 implementation might use a comparator network that sorts three elements but includes redundant instructions. AlphaDev discovered a 'swap move' sequence that saves one instruction by recognizing that if B ≤ C is guaranteed, only min(A,B) is needed instead of min(A,B,C).
Key Novelty
AlphaDev (AssemblyGame RL Agent)
  • Formulates algorithm generation as a single-player game (AssemblyGame) where the agent builds a program instruction-by-instruction, receiving rewards for correctness and latency.
  • Uses a specialized neural architecture that combines a Transformer encoder for instruction sequences with a CPU state encoder to model register/memory dynamics during the search.
  • Optimizes for actual measured latency (not just proxy metrics like length) by running generated code on real hardware during the training loop.
Evaluation Highlights
  • Discovered fixed-sort algorithms (Sort 3, Sort 5) with fewer instructions than optimal human benchmarks, integrated into the LLVM C++ standard library.
  • Improved Variable Sort 5 (VarSort5) latency by approximately 5.7% (312k vs 331k ns) compared to human benchmarks.
  • Improved VarInt deserialization (Protocol Buffers) latency by roughly 3x compared to the human benchmark (97k vs 295k ns).
Breakthrough Assessment
10/10
Achieved the first change to the standard LLVM sort library in over a decade by automatically discovering algorithms that are objectively faster than human-optimized code.
×