Yes, Q-learning Helps Offline In-Context RL

📝 Paper Summary

Offline Reinforcement Learning In-Context Learning

Replacing the supervised learning objective in Algorithm Distillation with explicit offline RL objectives (like CQL and IQL) significantly improves in-context learning performance, especially on suboptimal datasets.

Core Problem

Existing offline In-Context RL methods like Algorithm Distillation (AD) rely on supervised learning, which mimics behavior rather than maximizing reward, failing when datasets are suboptimal or unstructured.

Why it matters:

Offline datasets often contain suboptimal or unstructured trajectories, making pure imitation (supervised learning) ineffective for deriving optimal policies
Real-world applications (robotics, healthcare) require offline pre-training for safety but need agents that can improve over the data, not just copy it
Current methods struggle without 'learning histories' (sequences of improving policies), which are rarely available in practice

Concrete Example: When trained on 'early' (low-quality) datasets where an agent has not yet learned to solve the task, standard Algorithm Distillation (AD) achieves a normalized score of < 0.4 because it clones the bad behavior. In contrast, the proposed RL-based method (IC-DQN) extracts better policies from the same data, achieving significantly higher scores.

Key Novelty

Offline In-Context RL with Explicit Value Optimization

Replace the next-token prediction head of a Transformer with value function heads (Q-values) to explicitly maximize expected return rather than just predicting the next action
Incorporate offline RL regularizations (like conservatism in CQL) directly into the in-context learning framework to handle out-of-distribution actions and suboptimal data

Architecture

The proposed architecture for RL-based Offline ICRL.

Evaluation Highlights

+28.8% average improvement on test targets in discrete environments compared to Algorithm Distillation (AD) using Conservative Q-Learning (IC-CQL)
Doubled performance (0.22 → 0.46 NAUC) on the challenging XLand-MiniGrid environment compared to AD when using Implicit Q-Learning (IC-IQL)
Strong robustness to random data ordering: RL methods outperform AD when 'learning histories' (sequential improvements) are shuffled or unavailable

Breakthrough Assessment

7/10

Strong empirical evidence across 150+ datasets that explicit RL objectives are superior to supervised learning for offline ICRL. While the architecture is standard, the finding challenges the dominance of decision-transformer-style supervision in this niche.

⚙️ Technical Details

Problem Definition

Setting: Offline In-Context Reinforcement Learning (Offline ICRL)

Inputs: A sequence of trajectories (learning history) from a source environment: (o, a, r, done) tuples

Outputs: An optimal action a_t for the current observation o_t, adapted to the specific task/environment in context

Pipeline Flow

Input Sequence Construction (Context)
Transformer Backbone Processing
Value/Policy Head Prediction

System Modules

Input Embedding

Encodes transitions into tokens

Model or implementation: Linear Projection

Transformer Backbone

Processes the history of interactions to infer the task and current policy state

Model or implementation: Causal Transformer (GPT-2 style)

RL Heads

Predicts values (Q-values) or actions based on the chosen RL algorithm

Model or implementation: MLP Heads (Q-head, V-head, Policy-head)

Novel Architectural Elements

Replacement of the supervised next-action prediction head with dedicated RL heads (Q-function, Value-function) on top of the AD transformer backbone
Integration of offline RL loss functions (CQL, IQL) into the sequence modeling framework

Modeling

Base Model: Causal Transformer (similar to AD backbone)

Training Method: Offline Reinforcement Learning (IC-DQN, IC-CQL, IC-IQL, IC-TD3+BC)

Objective Functions:

Purpose: Minimize temporal difference error for Q-values (DQN).

Formally: L_TD = E[(Q(s,a) - (r + γ max Q(s',a')))^2]
Purpose: Penalize Q-values for OOD actions (CQL).

Formally: L_CQL = L_TD + α * (log sum exp Q(s,a) - Q(s,a_data))
Purpose: Implicitly maximize value using expectile regression (IQL).

Formally: L_IQL = L_V + L_Q + L_Actor (weighted BC)

Training Data:

150+ datasets derived from GridWorld (Dark Room, Key-to-Door) and MuJoCo
Categorized by 'expertise' (early, mid, late) and coverage (number of targets, histories per target)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
context_length: Spans multiple episodes (implied)
+ 1 more
discount_factor_gamma: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. AD: Optimizes RL objective (reward maximization) instead of Supervised objective (behavior cloning). Adds value heads.
vs. DPT: Does not require optimal action oracle; learns from suboptimal data via Q-learning.
vs. Decision Transformer: Maintains the 'learning history' context of AD but changes the loss to Q-learning.
+ 1 more
vs. Online ICRL (AMAGO): Adapted specifically for the offline setting where no interaction is possible during training.

Limitations

Relies on large model capacities and diverse training data for in-context capabilities to emerge
Does not fully resolve challenges with Out-of-Distribution (OOD) test environments or dynamics
Hyperparameter tuning for RL approaches was limited compared to the AD baseline
Does not introduce specific mechanisms for adaptation to novel environments beyond standard context

Reproducibility

No code URL provided in the paper. Datasets are described in detail (GridWorld, MuJoCo variants). Hyperparameters for RL baselines are mentioned as 'limited tuning compared to AD'.

📊 Experiments & Results

Evaluation Setup

Offline training on fixed datasets, followed by online evaluation on unseen tasks/targets.

Benchmarks:

Dark Room (DR) (2D GridWorld Navigation (Discrete))
Dark Key-to-Door (K2D) (POMDP GridWorld (Discrete))
MuJoCo (HalfCheetah, Ant, Hopper, Walker) (Continuous Control)
XLand-MiniGrid (Meta-RL GridWorld)

Metrics:

Normalized Area Under the Curve (NAUC)
Return after N episodes (25, 50, 100)
Interquartile Mean (IQM) of NAUC
Statistical methodology: Performance profiles (rliable), Interquartile Mean (IQM), Confidence Intervals (95%)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall comparison on Discrete Environments (aggregated across all datasets) shows RL methods outperforming AD.
Discrete Environments (Avg Test Targets)	Improvement over AD	0.0	28.8	+28.8
Performance on challenging XLand-MiniGrid 'tiny' dataset showing large gains.
XLand-MiniGrid	NAUC	0.22	0.46	+0.24
Impact of dataset expertise (quality). RL methods excel on low-quality data.
Early Datasets (Discrete)	NAUC	0.4	0.8	+0.4
Continuous environment performance (MuJoCo).
Continuous Environments (Overall)	Average Test NAUC	0.6	0.8	+0.2

Experiment Figures

Bar charts comparing average test NAUC scores across all datasets for discrete (left) and continuous (right) environments.

Performance profiles (rliable) for Train vs. Test targets.

NAUC scores when data ordering is destroyed (Random or Sorted Sample instead of Learning History).

Main Takeaways

Explicit RL objectives consistently outperform supervised AD, particularly on unseen test targets (+28.8% for CQL)
RL methods are far more robust to data quality, excelling on 'early' (suboptimal) datasets where AD fails completely due to behavior cloning limitations
Offline RL approaches (CQL, IQL) generally outperform standard online RL (DQN) in this setting, highlighting the need for conservatism
RL methods handle unstructured data (randomly ordered trajectories) much better than AD, which relies on the sequential structure of learning histories
In continuous domains, offline RL (TD3+BC, IQL) outperforms both AD and online RL (TD3), proving the necessity of offline regularization

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Q-learning)
Offline RL challenges (OOD actions, distributional shift)
Transformer architectures (Attention, Context windows)
Algorithm Distillation (AD)

Key Terms

Algorithm Distillation (AD): An offline ICRL method that trains a Transformer to predict actions given a sequence of learning updates, effectively 'distilling' a learning algorithm into the model weights

ICRL: In-Context Reinforcement Learning—learning to solve a new RL task purely from context (prompt) without weight updates

NAUC: Normalized Area Under the Curve—a metric measuring the cumulative performance of an agent over a fixed number of evaluation episodes

Conservative Q-Learning (CQL): An offline RL algorithm that learns a lower-bound on the value function to prevent overestimation of unseen actions

Implicit Q-Learning (IQL): An offline RL algorithm that avoids querying values of unseen actions by treating the value function as a random variable and using expectile regression

Learning History: A sequence of trajectories collected by an agent while it is learning a task, showing progressive improvement

OOD: Out-of-Distribution—data or scenarios not encountered during training