Offline Actor-Critic Reinforcement Learning Scales to Large Models

📝 Paper Summary

Offline Reinforcement Learning Scalable Transformer Architectures for Control

Perceiver-Actor-Critic (PAC) demonstrates that offline actor-critic algorithms scale to billion-parameter models following supervised-style scaling laws, outperforming behavioral cloning on suboptimal multi-task robotics data.

Core Problem

Supervised Behavioral Cloning (BC) fails when expert data is scarce or suboptimal, while existing offline RL methods have not been successfully scaled to large transformer models due to instability and computational costs.

Why it matters:

Robotics data is often expensive to collect and suboptimal, making reliance on pure expert demonstrations (required for BC) a major bottleneck
Prior work scaling transformers for control (like Gato) relies on BC, missing the ability of RL to learn from mixed-quality data or self-improve
It was previously unknown if RL objectives follow the same power-law scaling relations as supervised learning

Concrete Example: In the CHEF simulation task involving object stacking with suboptimal data (28% success rate), a standard BC approach only achieves 17.0% success, failing to learn a competent policy. The proposed method recovers a 55.0% success rate from the same suboptimal dataset.

Key Novelty

Perceiver-Actor-Critic (PAC)

Adapts the Perceiver-IO architecture for RL by using latent cross-attention to handle massive multimodal inputs (vision, text, proprioception) efficiently without quadratic scaling costs
Injects actions into the Q-function via cross-attention (late fusion) rather than as inputs, enabling efficient evaluation of multiple action candidates against a cached state representation

Architecture

The Perceiver-Actor-Critic (PAC) architecture, detailing the flow from multimodal inputs to policy and value outputs

Evaluation Highlights

Outperforms Gato baseline on 32 Control Suite tasks (92.1% vs 63.6% expert score) using the same dataset
Achieves 3x higher success rate than BC (55.0% vs 17.0%) on the CHEF task using severely suboptimal training data
Scales to 988M parameters, demonstrating for the first time that offline RL follows power-law scaling laws similar to LLMs, often scaling more efficiently than BC

Breakthrough Assessment

9/10

Establishes the first clear evidence of scaling laws for offline RL, provides a recipe for training 1B+ parameter RL agents, and demonstrates mastery of real-world robotics tasks via self-improvement.

⚙️ Technical Details

Problem Definition

Setting: Multi-task offline reinforcement learning in a Markov Decision Process (MDP)

Inputs: Multimodal state observations s_t (proprioception, vision) and task descriptions τ (language or goal images)

Outputs: Action a_t (continuous control) and estimated Q-values

Pipeline Flow

Input Processing (Encoders)
Latent Transformation (Perceiver Backbone)
Output Decoding (Policy & Critic Heads)

System Modules

Modality Encoders

Embed raw multimodal inputs into a unified sequence

Model or implementation: Various (ResNet for vision, Tokenizer for text, MLP for proprioception)

Latent Transformer

Process information in a compressed latent space to avoid quadratic complexity of full attention

Model or implementation: Perceiver-IO backbone (Cross-Attention -> Stacked Self-Attention)

Policy Decoder (Output Decoding)

Predict actions given the latent state

Model or implementation: Cross-Attention Head

Q-Value Decoder (Output Decoding)

Estimate state-action values (Critic)

Model or implementation: Cross-Attention Head

Novel Architectural Elements

Action-Conditioned Q-Decoder: Actions are injected into the critic via the query vector of a cross-attention block rather than as input tokens, decoupling state encoding from action evaluation
Multi-scale normalizer for proprioception inputs to handle arbitrary input scales across diverse tasks

Modeling

Base Model: Perceiver-Actor-Critic (PAC) L-size (approx. 1B parameters)

Training Method: Offline KL-Regularized Actor-Critic (simplified MPO)

Objective Functions:

Purpose: Optimize policy to maximize Q-values while staying close to a reference policy.

Formally: J(π) = E[Q_π(s,a) - η D_KL(π, π_ref)]
Purpose: Regularize policy towards the behavior distribution to prevent overestimation (BC term).

Formally: L_BC = α D_KL(b, π_θ)
Purpose: Learn a distributional Q-function.

Formally: L_Q = β D_KL(Γ_θ'(q), p_θ(q))

Training Data:

3.64M episodes total
Gato dataset (Control Suite)
RoboCat dataset (RGB Stacking, Insertion)
CHEF dataset (Stacking, sim & real)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: 512 trajectories
updates: 3M
+ 2 more
alpha_bc_weight: 0.75 (default PAC) or tuned per dataset
N_action_samples: 10

Compute: Inference at 20 Hz on local Nvidia RTX 3090 GPU for 988M parameter model

Comparison to Prior Work

vs. Gato: PAC uses offline RL instead of pure BC, enabling learning from suboptimal data and outperforming on control tasks
vs. RoboCat: PAC achieves comparable or better performance with the same model scale but introduces an efficient critic for self-improvement
vs. Q-Transformer: PAC uses a Perceiver architecture with action-querying for efficiency, whereas Q-Transformer relies on computationally expensive autoregressive discretization [not cited in paper as direct architectural baseline, but conceptually relevant]

Limitations

Requires reward annotations for all data, which can be expensive or unavailable in some domains
No strong evidence of zero-shot transfer across tasks observed in the current experiments
Training large models is computationally intensive, requiring TPUs and significant time (3M steps)

Reproducibility

No code or model weights provided. The paper describes the architecture and algorithm in detail, including hyperparameters for the scaling laws in Appendix C. Training requires TPU infrastructure (implied context) but specific compute hours are not listed.

📊 Experiments & Results

Evaluation Setup

Multi-task continuous control in simulation (DeepMind Control Suite, Mujoco) and real-world robotics (Sawyer arm)

Benchmarks:

Gato Control Suite (Continuous control (32 tasks))
RoboCat (RGB Stacking, Insertion) (Robotic manipulation)
CHEF (Stacking) (Robotic manipulation (Sim & Real))

Metrics:

Success Rate
Average Return (normalized to expert)
Statistical methodology: Wilson score intervals (alpha=0.05) for success rates; Standard error for 95% CIs on rewards

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PAC outperforms strong BC baselines on standard benchmarks, particularly on the Gato Control Suite tasks.
Gato Control Suite	Expert Score %	63.6	92.1	+28.5
RoboCat: Insertion	Success Rate	71.3	89.3	+18.0
Experiments on suboptimal data demonstrate that PAC recovers performance where BC fails.
CHEF: Sim	Success Rate	17.0	55.0	+38.0
Real-world finetuning experiments show the capability to master tasks using self-generated data.
CHEF: Real	Success Rate	69.8	93.2	+23.4

Experiment Figures

Scaling law plots for PAC, showing the relationship between FLOPs, parameters, and tokens against average return

Main Takeaways

Offline RL scaling laws exist: PAC performance follows a power law with respect to compute/parameters, often with a steeper improvement slope than BC
RL is superior to BC on suboptimal data: While BC performance collapses on low-quality datasets, PAC extracts competent policies significantly exceeding the average performance of the data
Seamless transition: The method allows stable pre-training with high BC weight (alpha) and gradual transition to pure RL for fine-tuning, enabling safe training of large models
Self-improvement works at scale: A 1B parameter model can be fine-tuned on real-world robots using self-generated data to achieve mastery (>90% success)

📚 Prerequisite Knowledge

Prerequisites

Offline Reinforcement Learning (Actor-Critic methods)
Transformer architectures (Attention mechanisms)
Perceiver / Perceiver-IO architecture
Scaling laws (Chinchilla)

Key Terms

PAC: Perceiver-Actor-Critic—the proposed neural architecture adapting Perceiver-IO for scalable actor-critic RL

BC: Behavioral Cloning—supervised learning that mimics the actions in a dataset

MPO: Maximum a Posteriori Policy Optimisation—an RL algorithm that frames policy updates as weighted supervised learning

Offline RL: Reinforcement learning using a fixed dataset without interacting with the environment during training

Perceiver-IO: A transformer architecture that maps high-dimensional inputs to a smaller latent array via cross-attention to reduce computational complexity

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

Proprioception: Sensing the position, movement, and orientation of the robot's own body parts

Scaling laws: Empirical power-law relationships between model size, dataset size, compute budget, and performance