Representation-Based Exploration for Language Models: From Test-Time to Post-Training

📝 Paper Summary

Reinforcement Learning for Language Models Exploration in RL Post-training optimization

Using elliptical bonuses derived from pre-trained model hidden states significantly improves diversity and reasoning performance in both inference-time selection and RL post-training.

Core Problem

Current RL post-training methods often fail to discover novel behaviors, instead merely sharpening existing ones, and struggle when the base model has low probability of generating correct answers.

Why it matters:

Existing RL recipes may simply amplify behaviors the base model can already execute rather than unlocking new capabilities
Data scale and quality are becoming bottlenecks in complex domains where current interventions fall short of eliciting desired behavior
Without explicit exploration, models suffer from 'diversity collapse' during RL, degrading performance on harder tasks where diverse attempts are needed

Concrete Example: In math reasoning, a standard RL-trained model might converge to a single proof strategy. If that strategy is flawed for a specific problem type, the model consistently fails. In contrast, an exploration-guided model maintains diverse proof strategies, increasing the chance of finding a correct solution.

Key Novelty

Representation-Based Elliptical Bonuses (RepExp)

Adapt linear bandit theory to language models by treating hidden state representations as feature vectors for calculating novelty
Compute an 'elliptical bonus' that rewards responses whose representations are dissimilar to those previously selected or generated
Apply this bonus in two settings: selecting diverse coresets of responses at inference time, and augmenting the reward function during RL post-training

Evaluation Highlights

+50% improvement in verifier efficiency on Qwen-2.5-14b-Instruct across GSM8K, MATH, MBPP+, and Game-of-24 using inference-time selection
Post-trained Qwen-2.5-7b-Instruct matches the pass@256 performance of standard GRPO using only pass@80 (a 3x improvement in test-time sample efficiency) on AIME 2024
Eliminates 'diversity collapse' in RL post-training, maintaining high pass@k rates for large k where standard RL typically degrades below the base model

Breakthrough Assessment

8/10

Offers a principled, scalable solution to the exploration problem in LLMs. The 3x efficiency gain in post-training and elimination of diversity collapse are significant practical advances.

⚙️ Technical Details

Problem Definition

Setting: Language model exploration in two regimes: (1) Inference-time subset selection and (2) RL post-training with verifiable rewards

Inputs: Prompt x and a budget k for verifier queries

Outputs: A set of k responses (inference-time) or an updated policy π (post-training)

Pipeline Flow

Input Prompt x
Group: Generation & Feature Extraction (Base Model -> Hidden States)
Group: Selection/Reward Calculation (Covariance Update -> Bonus Computation)
Output Selection or Policy Update

System Modules

Base Language Model

Generate candidate responses and provide hidden state representations

Model or implementation: Qwen-2.5-7b-Instruct / Qwen-2.5-14b-Instruct / Qwen-2.5-32b-Instruct

Bonus Calculator (Selection/Reward Calculation)

Compute elliptical bonus for exploration

Model or implementation: Linear algebra operation (Woodbury identity update)

Selector (Inference-Time Only) (Selection/Reward Calculation)

Iteratively select response maximizing bonus

Model or implementation: Greedy maximization algorithm (RepExp)

Novel Architectural Elements

Integration of online covariance matrix updates (elliptical bonuses) directly into the reward loop of language model post-training
Use of averaged hidden state representations as feature vectors for exploration bonuses in LM reasoning tasks

Modeling

Base Model: Qwen-2.5-Instruct series (7B, 14B, 32B)

Training Method: Group Relative Policy Optimization (GRPO) augmented with exploration bonuses

Objective Functions:

Purpose: Maximize expected reward with exploration bonus.

Formally: Reward r(x, y_i) = r*(x, y_i) + beta * h_bar(x, y_i)^T Sigma^-1 h_bar(x, y_i)

Key Hyperparameters:

projection_dimension: 512
bonus_parameter_beta: Values not explicitly detailed in summary text (likely in appendix)
regularization_lambda: Standard ridge regularization parameter

Compute: Matrix updates are O(d^2) per step using Woodbury identity

Comparison to Prior Work

vs. GRPO: RepExp adds an intrinsic exploration reward based on representation novelty, preventing mode collapse
vs. Naive Sampling (min-p, nucleus): RepExp actively selects diverse sets rather than relying on stochasticity
vs. RND/Curiosity [not cited in paper]: RepExp requires no auxiliary networks or training, using linear algebraic updates on existing representations

Limitations

Inference-time selection requires generating a large pool of N candidates first, which computes N forward passes even if only k are selected (verifier efficient but not compute efficient)
The method relies on the quality of the pre-trained representations; if the base model's representations do not capture semantic diversity, the bonuses may fail
Matrix operations for the covariance update scale with d^2, requiring random projection to keep dimensions manageable (d=512)

Reproducibility

Code: https://rep-exp.github.io

Code and website available at https://rep-exp.github.io. Hyperparameters for specific experiments (beta values) referenced as being in the paper body or appendix (standard practice).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning and code generation tasks

Benchmarks:

GSM8K (Grade school math reasoning)
MATH (Challenging math problems)
MBPP+ (Python code generation)
Game-of-24 (Mathematical reasoning game)
AIME 2024 (High-difficulty math competition)

Metrics:

pass@k
Verifier Efficiency (AUC of pass@k vs k)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Inference-time experiments demonstrate that selecting responses via representation-based exploration (RepExp) yields higher pass rates for lower k budgets compared to random sampling.
Various (GSM8K, MATH, MBPP+, Game-of-24)	Verifier Efficiency Improvement	0	50	+50
Post-training experiments show RepExp significantly improves sample efficiency and prevents performance degradation at high k (diversity collapse).
AIME 2024	pass@80 equivalent	See Note	See Note	3x efficiency
MATH	Sample Efficiency	See Note	See Note	>2x

Experiment Figures

Pass@k curves for Base Model, GRPO, and RepExp (Ours) on AIME 2024.

Ablation of representation choices (Average vs Last Token vs Penultimate Token).

Main Takeaways

Representation-based exploration (RepExp) consistently improves verifier efficiency across diverse models and tasks (GSM8K, MATH, MBPP+).
Integrating RepExp into RL post-training (GRPO) prevents 'diversity collapse,' allowing models to maintain high pass@k rates where baselines degrade.
Averaging representations across all tokens captures better diversity information than using just the final token embedding.
The method is effective both as a static inference-time filter and as a dynamic reward signal during training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics (policy, reward, exploration vs exploitation)
Linear bandits and elliptical potentials
Language model post-training (GRPO, PPO)

Key Terms

pass@k: The probability that at least one of k generated samples is correct

GRPO: Group Relative Policy Optimization—an RL algorithm that updates policies based on the relative rewards of a group of samples for the same prompt

elliptical bonus: A novelty score derived from linear regression theory (specifically the inverse covariance matrix of seen features) indicating how well a new feature vector is covered by previous data

verifier efficiency: The number of verifier queries (samples checked) required to find a correct solution; higher efficiency means fewer queries needed

diversity collapse: A phenomenon where RL-trained models lose the ability to generate diverse responses, leading to worse performance when many samples are allowed (high k)

coreset: A small, representative subset of data points that approximates the properties of the full dataset

hidden states: Internal vector representations of the input text within the neural network layers, before the final output generation

sharpening: The tendency of RL to simply increase the probability of already-known high-likelihood behaviors rather than discovering new ones