← Back to Paper List

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash
Princeton University, Microsoft Research NYC
arXiv (2025)
RL Reasoning

📝 Paper Summary

Reinforcement Learning for Language Models Exploration in RL Post-training optimization
Using elliptical bonuses derived from pre-trained model hidden states significantly improves diversity and reasoning performance in both inference-time selection and RL post-training.
Core Problem
Current RL post-training methods often fail to discover novel behaviors, instead merely sharpening existing ones, and struggle when the base model has low probability of generating correct answers.
Why it matters:
  • Existing RL recipes may simply amplify behaviors the base model can already execute rather than unlocking new capabilities
  • Data scale and quality are becoming bottlenecks in complex domains where current interventions fall short of eliciting desired behavior
  • Without explicit exploration, models suffer from 'diversity collapse' during RL, degrading performance on harder tasks where diverse attempts are needed
Concrete Example: In math reasoning, a standard RL-trained model might converge to a single proof strategy. If that strategy is flawed for a specific problem type, the model consistently fails. In contrast, an exploration-guided model maintains diverse proof strategies, increasing the chance of finding a correct solution.
Key Novelty
Representation-Based Elliptical Bonuses (RepExp)
  • Adapt linear bandit theory to language models by treating hidden state representations as feature vectors for calculating novelty
  • Compute an 'elliptical bonus' that rewards responses whose representations are dissimilar to those previously selected or generated
  • Apply this bonus in two settings: selecting diverse coresets of responses at inference time, and augmenting the reward function during RL post-training
Evaluation Highlights
  • +50% improvement in verifier efficiency on Qwen-2.5-14b-Instruct across GSM8K, MATH, MBPP+, and Game-of-24 using inference-time selection
  • Post-trained Qwen-2.5-7b-Instruct matches the pass@256 performance of standard GRPO using only pass@80 (a 3x improvement in test-time sample efficiency) on AIME 2024
  • Eliminates 'diversity collapse' in RL post-training, maintaining high pass@k rates for large k where standard RL typically degrades below the base model
Breakthrough Assessment
8/10
Offers a principled, scalable solution to the exploration problem in LLMs. The 3x efficiency gain in post-training and elimination of diversity collapse are significant practical advances.
×