← Back to Paper List

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

M Liu, L Jiang, Y Liang, SS Du, Y Choi, T Althoff…
University of Washington, Stanford University
arXiv, 6/2025 (2025)
RL Reasoning Agent Benchmark

📝 Paper Summary

LLM Safety Alignment Red Teaming Multi-Agent Reinforcement Learning (MARL)
Self-RedTeam continuously co-evolves a single LLM acting as both attacker and defender via online reinforcement learning, theoretically guaranteeing safety at Nash Equilibrium.
Core Problem
Conventional safety alignment relies on disjoint phases where attackers exploit static models and defenders patch known exploits, creating a reactive cat-and-mouse game where defenders perpetually lag behind new threats.
Why it matters:
  • Static attack datasets quickly become obsolete as models learn to defend against specific patterns but remain vulnerable to novel variations
  • Training attackers and defenders in isolation leads to overfitting, preventing the development of robust, generalizable safety mechanisms
  • Public failures of aligned models (e.g., causing economic damage) demonstrate that current disjoint methods fail to provide safety guarantees
Concrete Example: An attacker trained against a static defender might overfit to generating 'disinformation campaign' prompts because they work, failing to explore other vectors. Meanwhile, a defender trained on static data might refuse 'how to kill a process' (benign) because it resembles 'how to kill a person' (harmful), lacking the nuance developed through dynamic interaction.
Key Novelty
Zero-Sum Self-Play Safety Game with Hidden Reasoning
  • Formulate safety alignment as a two-player zero-sum game where one model alternates between generating attacks and defending against them, optimizing toward a Nash Equilibrium where safety is theoretically guaranteed
  • Introduce 'Hidden Chain-of-Thought' where agents reason privately about their strategy (e.g., how to bypass a filter or how to detect a trap) before generating visible outputs, preventing the opponent from seeing the strategy
Architecture
Architecture Figure Figure 1
The Self-RedTeam workflow where a single model alternates roles. It illustrates the 'Think before act' mechanism with hidden thoughts and the interaction loop judged by a reward model.
Evaluation Highlights
  • Reduces Attack Success Rate (ASR) by up to 95% across 12 safety benchmarks compared to standard RLHF-aligned models
  • Discovering 17.8% more diverse attacks (measured by SBERT similarity) compared to attackers trained against static defenders
  • Achieves 38.08% length-controlled winrate on AlpacaEval-2, outperforming defender-only baselines (35.50%) and showing safety gains don't degrade general capabilities
Breakthrough Assessment
8/10
Significant advancement by successfully applying online MARL to LLM safety with theoretical backing. Moves beyond static datasets to dynamic co-evolution, showing strong empirical gains in both safety and attack diversity.
×