Supervised Pretraining Can Learn In-Context Reinforcement Learning

📝 Paper Summary

In-Context Learning Reinforcement Learning

Training a transformer to simply predict optimal actions from interaction histories enables it to act as an effective in-context RL algorithm that performs exploration and improves upon its training data.

Core Problem

Standard RL methods often require expensive retraining for new tasks, while existing offline RL methods struggle to improve beyond their datasets or handle online exploration naturally.

Why it matters:

Real-world agents need to adapt to new environments instantly (few-shot) without parameter updates
Current supervised transformers (like Decision Transformer) generally cannot outperform the behavior policy in their training data
Posterior Sampling is a powerful theoretical RL algorithm but is computationally intractable to implement directly

Concrete Example: In a linear bandit problem, a standard supervised model mimics the suboptimal algorithm used to generate the data. In contrast, DPT learns the underlying structure (linearity) and performs efficient exploration to find the optimal arm, achieving lower regret than the algorithm that generated its training data.

Key Novelty

Decision-Pretrained Transformer (DPT)

Pretrains a transformer using supervised learning to predict the *optimal* action given a query state and a context of suboptimal interactions
Demonstrates that this simple objective leads to emergent exploration and posterior sampling behavior at test time, without explicit exploration training

Architecture

Overview of the Decision-Pretrained Transformer (DPT) framework. It shows the pretraining phase where the model learns to predict optimal actions from datasets, and the evaluation phase where it interacts with a new environment to collect data and refine its policy in-context.

Evaluation Highlights

DPT achieves sub-linear regret on linear bandit tasks even when pretrained on data from a uniform sampling policy (which has linear regret)
Matches the performance of LinUCB (an optimal analytic algorithm) on linear bandits with unknown representations, outperforming algorithm distillation
In Dark Room MDPs, DPT effectively explores to find an unseen goal state and generalizes to new map layouts not seen during pretraining

Breakthrough Assessment

8/10

Provides a strong theoretical and empirical link between supervised pretraining and Bayesian posterior sampling, showing transformers can learn to explore and improve over training data—addressing a major criticism of prior work like Decision Transformer.

⚙️ Technical Details

Problem Definition

Setting: In-context Reinforcement Learning via Supervised Pretraining

Inputs: A query state s_query and a context dataset D = {(s, a, s', r)} of interactions

Outputs: Predicted optimal action a*

Pipeline Flow

Context Encoder (processes interaction history)
Query Decoder (predicts action for new state)

System Modules

Transformer Backbone

Processes the sequence of transitions and the query state to predict the optimal action

Model or implementation: GPT-2 architecture (causal transformer)

Novel Architectural Elements

The architecture is standard GPT-2; the novelty lies in the training objective (predicting optimal labels from suboptimal contexts) rather than the pipeline structure.

Modeling

Base Model: GPT-2

Training Method: Supervised Learning (Log-likelihood maximization)

Objective Functions:

Purpose: Maximize the likelihood of the optimal action given the context and query state.

Formally: Minimize -log M_theta(a* | s_query, D)

Training Data:

Generated synthetically: Sample task tau ~ T_pre
Sample dataset D ~ D_pre(.; tau) (e.g., random rollouts, expert demos)
Sample query state s_query and optimal label a* from pi*_tau

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Algorithm Distillation: DPT predicts optimal actions directly rather than cloning the learning steps of another algorithm; DPT can outperform the algorithm generating its data
vs. Decision Transformer: DPT does not require conditioning on return; DT is often limited to the best performance in the dataset, whereas DPT can improve upon it
vs. Posterior Sampling: DPT approximates PS efficiently in a single forward pass without needing explicit Bayesian updates or tractable posteriors

Limitations

Requires access to optimal action labels during pretraining (assumes solved tasks at training time)
Generalization is limited by the diversity of the pretraining task distribution
Context length of the transformer limits the horizon of in-context learning (long-term memory)

Reproducibility

No code URL provided. Method is described algorithmically (Algorithm 1). Hyperparameters like learning rate and model size are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluated on multi-armed bandits (Bernoulli, Linear) and Markov Decision Processes (Dark Room, Gridworld)

Benchmarks:

Multi-Armed Bandits (Online exploration and regret minimization)
Dark Room MDP (Gridworld navigation with unknown goal location)
Procgen Maze (Visual navigation in procedurally generated mazes)

Metrics:

Cumulative Regret
Success Rate (reaching goal)
Statistical methodology: Standard deviation shading in plots (implied, not explicitly detailed textually)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Linear Bandits (Unknown Representation)	Cumulative Regret	Linear Growth (approx 100 at step 100 based on plot trend)	Sublinear Growth (approx 20 at step 100 based on plot trend)	-80 (approx)
Dark Room MDP	Success Rate	0.15	0.95	+0.80
Dark Room MDP	Steps to Goal	Slower convergence	Faster convergence	Positive qualitative improvement

Experiment Figures

Performance of DPT on Dark Room MDPs in three settings: Online Exploration, Offline Conservatism, and Generalization.

Regret plots for Bandit problems (Bernoulli and Linear).

Main Takeaways

DPT effectively implements posterior sampling in-context, allowing it to solve exploration problems efficiently.
The model generalizes to new reward functions, dynamics, and map layouts not seen during training.
It improves over the behavior policy used to collect the pretraining data (e.g., learning efficient exploration from random data).
Acts conservatively in offline settings and exploratory in online settings automatically based on the context.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, bandits, regret)
Transformer architectures (attention mechanisms)
Bayesian inference (posterior sampling, Thompson sampling)

Key Terms

Posterior Sampling: A Bayesian RL strategy where an agent samples a hypothesis (a model of the world) from a posterior distribution and acts optimally according to that hypothesis

Regret: The difference between the total reward an optimal policy would have achieved and the reward actually collected by the agent

Algorithm Distillation: A method where a model is trained to clone the history of updates of a standard RL algorithm (like Q-learning) to learn a learning algorithm

Decision Transformer: An architecture that models RL as a sequence modeling problem, predicting actions based on past states and desired returns

In-Context Learning: The ability of a model to adapt its behavior to a new task given a few examples (context) at inference time without parameter updates