Q-value Regularized Transformer for Offline Reinforcement Learning

📝 Paper Summary

Offline Reinforcement Learning Conditional Sequence Modeling

QT combines a Transformer-based policy for trajectory modeling with a Q-value regularization term to effectively stitch optimal sub-trajectories from sub-optimal data while maintaining stability.

Core Problem

Conditional Sequence Modeling (CSM) methods like Decision Transformer struggle to stitch together optimal trajectories from sub-optimal data because they treat trajectories as whole units rather than leveraging optimal state-level transitions.

Why it matters:

Standard CSM methods rely on high-return trajectories in the dataset; if the data is sub-optimal, the model cannot infer a policy better than the best demonstration.
Traditional Dynamic Programming (DP) methods handle stitching well but are unstable in long-horizon or sparse-reward settings, leading to poor convergence.
Existing hybrid methods (like QDT) often just use Q-values for data augmentation (relabeling returns), which fails to address unmatched return-to-go values during inference.

Concrete Example: In a maze navigation task (Maze2D), a CSM model might see two sub-optimal paths: A->B (reward 5) and B->C (reward 5). It cannot combine them to form A->B->C (reward 10) if that full trajectory doesn't exist in the data. QT uses Q-learning to value the transition at B, enabling the Transformer to stitch these segments together.

Key Novelty

Q-value Regularized Transformer (QT)

Integrates a conservative Q-learning objective directly into the Transformer's loss function, rather than just using it for data filtering or relabeling.
Uses the Transformer's standard prediction loss as a form of implicit policy regularization (keeping the policy close to behavior distribution) while the Q-value term pushes for policy improvement.
During inference, samples multiple candidate return-to-go tokens and uses the learned Q-function to select the action with the highest predicted value.

Architecture

The training and inference process of QT.

Evaluation Highlights

+85% improvement over Decision Transformer (DT) on the AntMaze-Large-Diverse task (score 53.3 vs 0.0), a challenging sparse-reward environment.
Achieves state-of-the-art average score of 129.6 on the Pen-Human Adroit task, significantly outperforming CQL (37.5) and IQL (71.5).
Consistently surpasses pure Q-learning methods (CQL, IQL) and sequence modeling methods (DT) across Maze2D stitching tasks, achieving an average score of 172.5 vs 62.0 (CQL) and 13.8 (DT).

Breakthrough Assessment

8/10

Significantly improves upon Decision Transformer by solving its primary weakness (stitching) without losing its benefits in sparse-reward settings. The results on AntMaze and Adroit are exceptionally strong compared to robust baselines.

⚙️ Technical Details

Problem Definition

Setting: Offline Reinforcement Learning in a Markov Decision Process (MDP)

Inputs: Static dataset of trajectory transitions (state, action, reward, next state)

Outputs: Policy mapping states to actions that maximizes cumulative discounted return

Pipeline Flow

Input Processing: History of states and RTG tokens → Transformer
Q-Network: State-Action pairs → Q-value estimation
Training: Minimize combined loss (Sequence Prediction + Q-value maximization)
Inference: Sample multiple RTG candidates → Transformer Action Proposals → Q-value Selection → Best Action

System Modules

Transformer Policy

Generates action distributions based on state history and target returns (acts as behavior cloning/regularization)

Model or implementation: GPT-2 architecture (causal transformer)

Q-Network

Estimates the long-term value of taking a specific action in a specific state; used to guide the policy towards higher returns

Model or implementation: MLP (Multi-Layer Perceptron)

Q-Selection Mechanism

Selects the best action from multiple proposals generated by conditioning the Transformer on different RTG values

Model or implementation: Argmax over learned Q-values

Novel Architectural Elements

Hybrid Loss Function: Combines auto-regressive sequence prediction loss (CSM) with a Q-value maximization term (DP) within the Transformer's optimization loop.
Q-Regularized Inference: Generates multiple action candidates by prompting the Transformer with different Return-to-Go tokens, then uses the learned Q-network to pick the best one.

Modeling

Base Model: GPT-2 (Transformer)

Training Method: Offline RL with hybrid objective (Supervised Learning + Q-Learning)

Objective Functions:

Purpose: Clone the behavior in the dataset (Policy Regularization).

Formally: L_DT = Mean Squared Error between predicted action and actual action in trajectory.
Purpose: Maximize the expected return of the policy (Policy Improvement).

Formally: L_Q = - Q(s, π(s)) (Negative Q-value of sampled action).
Purpose: Learn the Q-function.

Formally: Minimize Bellman error using n-step returns and Double Q-learning.

Key Hyperparameters:

eta: Hyperparameter balancing DT loss and Q-loss (controls regularization strength)
n-step: Looking ahead n steps for Q-value estimation
learning_rate: Not explicitly reported in the paper summary
+ 1 more
batch_size: Not explicitly reported in the paper summary

Compute: Not reported in the paper

Comparison to Prior Work

vs. QDT: QT integrates Q-values directly into the loss function for policy improvement, whereas QDT only uses them for data relabeling.
vs. Decision Transformer (DT): QT adds a Q-maximization term and Q-guided inference, enabling stitching capabilities that DT lacks.
vs. CQL/IQL: QT retains the Transformer's sequence modeling benefits (handling long horizons/sparse rewards) while adding the stitching capability of Q-learning.

Limitations

Inference cost is higher than standard DT because it requires multiple forward passes with different RTG tokens to select the best action.
Requires training a separate Q-network alongside the Transformer, increasing training complexity compared to pure BC or DT.
Performance depends on the 'eta' hyperparameter which balances behavior cloning and Q-maximization.

Reproducibility

Code: https://github.com/charleshsc/QT

Code is publicly available at https://github.com/charleshsc/QT. Hyperparameters like eta are mentioned as key tuning knobs but specific values per environment are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Offline RL evaluation on standard benchmarks.

Benchmarks:

D4RL (Locomotion, Manipulation, and Maze Navigation)

Metrics:

Normalized Score (0-100, where 100 is expert performance)
Statistical methodology: Reported mean and standard deviation across seeds (typically 3 or 5, though exact number not specified in text snippet).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Gym Locomotion (Average)	Normalized Score	77.0	89.8	+12.8
Adroit (Average)	Normalized Score	15.6	60.4	+44.8
Kitchen (Average)	Normalized Score	58.6	78.4	+19.8
Maze2D (Average)	Normalized Score	13.8	172.5	+158.7
AntMaze (Average)	Normalized Score	66.8	76.5	+9.7

Main Takeaways

QT consistently outperforms both pure Q-learning (CQL, IQL) and pure Sequence Modeling (DT) across all domains, showing it effectively combines their strengths.
The stitching capability is drastically improved over DT, evidenced by the Maze2D results (from ~14 to ~172).
Sparse reward performance (AntMaze) is superior to IQL, suggesting the Transformer's trajectory modeling helps where 1-step DP struggles.
The method is robust across diverse tasks: locomotion (Gym), manipulation (Kitchen/Adroit), and navigation (Maze).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Bellman equation)
Offline RL (stitching, out-of-distribution issues)
Transformers (attention mechanisms, sequence modeling)
Decision Transformer architecture

Key Terms

Stitching: The ability to combine parts of different sub-optimal trajectories to form a new, optimal trajectory that was never explicitly seen in the dataset.

Decision Transformer (DT): An offline RL method that treats policy learning as a sequence modeling problem, predicting actions based on past states and desired future returns (Return-to-Go).

Return-to-Go (RTG): The sum of future rewards from a specific timestep to the end of the episode.

Bellman Equation: A recursive equation used in Dynamic Programming to calculate the value (Q-value) of a state-action pair based on the immediate reward and the value of the next state.

Conservative Q-Learning (CQL): An algorithm that learns a lower-bound (conservative) estimate of the value function to prevent overestimation of unseen actions in offline RL.

Behavior Cloning (BC): Supervised learning where the policy is trained to mimic the actions in the dataset exactly.

n-step Bellman: A variation of the Bellman update that looks n steps into the future before bootstrapping the value estimate, often reducing bias.