Robotic Manipulation Datasets for Offline Compositional Reinforcement Learning

📝 Paper Summary

Offline Reinforcement Learning (Offline RL) Compositional Generalization Robot Manipulation

This paper introduces four large-scale offline reinforcement learning datasets derived from compositional robotic tasks to evaluate how well agents can decompose skills and generalize to unseen task combinations.

Core Problem

Standard offline RL benchmarks are typically single-task or lack structured relatedness between tasks, making it difficult to study whether agents can learn reusable functional components.

Why it matters:

Collecting robot data is expensive; offline RL promises to reuse existing data, but current datasets don't adequately test compositional generalization (combining known skills for new tasks).
Existing benchmarks often lack a clear notion of task relatedness, preventing analysis of selective transfer and functional decomposition.

Concrete Example: An agent might have data on 'picking up a cup' and 'pushing a box'. A compositional benchmark tests if the agent can combine these skills to 'push a cup' without ever seeing that specific task combination in the training data. Current monolithic agents often fail this zero-shot transfer.

Key Novelty

Compositional Offline RL Datasets (CompoSuite-Offline)

Provides 256 million transitions across 256 tasks generated by composing 4 axes (robot, object, obstacle, objective), creating a structured grid of related tasks.
Includes datasets with varying quality levels (Expert, Medium, Warmstart, Medium-Replay) to simulate realistic data availability scenarios where expert demonstrations are scarce.
Defines specific evaluation protocols (Compositional Sampling, Restricted Sampling) to rigorously test an agent's ability to extract and recombine functional modules.

Evaluation Highlights

Current offline RL methods (IQL, BC) achieve varying success on training tasks (up to 96% with expert data) but fail significantly on zero-shot compositional generalization (often <20%).
Compositional architectures (CP-IQL) outperform monolithic baselines on zero-shot tasks (e.g., +24% success rate on Expert-Warmstart split) but still struggle with Restricted Sampling.
Behavioral Cloning (BC) fails completely (0% success) when trained on 'Medium-Replay' data, while IQL maintains some performance, highlighting the difficulty of learning from noisy, multi-modal offline data.

Breakthrough Assessment

7/10

While not a new algorithm, the dataset fills a critical gap in offline RL by enabling rigorous study of compositionality. The baselines' poor zero-shot performance highlights a significant open challenge for the field.

⚙️ Technical Details

Problem Definition

Setting: Offline Compositional Reinforcement Learning on the CompoSuite benchmark

Inputs: Offline dataset D = {(s, a, s', r)} drawn from behavioral policies on a subset of tasks

Outputs: Policy π(a|s) capable of solving both training tasks and unseen zero-shot task combinations

Pipeline Flow

State Decomposition
Module Selection (Task-Dependent)
Modular Processing
Action Aggregation

System Modules

State Input

Receives 93-dimensional proprioceptive and object state vector

Model or implementation: Vector input

Compositional Modules (CP-BC / CP-IQL)

Process specific parts of the state corresponding to task elements (robot, object, etc.)

Model or implementation: Hierarchical Modular Neural Network

Action Head

Produces final motor commands

Model or implementation: MLP (Multilayer Perceptron)

Novel Architectural Elements

Application of Modular Neural Networks (MNN) to Offline RL: enforcing parameter sharing across tasks for specific functional components (robot, object, obstacle, objective) to induce compositional generalization.

Modeling

Base Model: Modular Neural Network (CP-IQL/CP-BC) vs Standard MLP (IQL/BC)

Training Method: Implicit Q-Learning (IQL) and Behavioral Cloning (BC)

Objective Functions:

Purpose: Minimize difference between predicted and actual actions in dataset.

Formally: MSE loss (BC) or weighted MSE (IQL actor).
Purpose: Learn value function without querying out-of-distribution actions (IQL only).

Formally: Expectile regression for V-function and MSE for Q-function.

Training Data:

Expert: 1M transitions/task from 90% success agent
Medium: 1M transitions/task from 30% success agent
Warmstart: 1M transitions/task from early SAC training (~1% success)
Medium-Replay: 1M transitions/task sampled from replay buffer up to 30% success

Key Hyperparameters:

batch_size: # training tasks * 256
bc_steps: 50,000
iql_steps: 300,000
+ 1 more
optimizer: Adam

Compute: Not reported in the paper

Comparison to Prior Work

vs. D4RL: CompoSuite-Offline provides structured task variations (compositional axes) to test generalization, whereas D4RL focuses on single-task performance.
vs. Meta-World: This paper focuses on the Offline RL setting with pre-collected datasets rather than online interaction.
vs. Conservative Q-Learning (CQL) [cited in paper]: Authors found CQL achieved 0% success on these specific multi-task settings, so they omitted it from detailed results.

Limitations

Simulated environment only (MuJoCo/Robosuite), no real-world robot data.
Dense rewards used for all tasks, simplifying the credit assignment problem compared to sparse reward settings.
Current offline RL baselines perform poorly on the compositional splits, suggesting the problem might be too difficult for existing methods without modification.
Data collection used PPO/SAC with compositional architectures, potentially biasing the data distribution towards modular solutions.

Reproducibility

Code: https://github.com/lifelong-ml/offline-compositional-rl-datasets

Datasets publicly available at Dryad (doi:10.5061/dryad.9cnp5hqps). Code and train-test splits available at github.com/lifelong-ml/offline-compositional-rl-datasets. Full hyperparameters provided in Appendix B.

📊 Experiments & Results

Evaluation Setup

Offline training on subsets of 256 tasks, followed by online evaluation in the CompoSuite simulator.

Benchmarks:

CompoSuite-Offline (Robotic Manipulation) [New]

Metrics:

Success Rate
Cumulative Return
Statistical methodology: Results averaged over 3 random seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Uniform Sampling: Evaluating agents trained on 224 tasks and tested on 32 unseen tasks. Confirms basic learnability on Expert data but highlights difficulty on lower quality data.
CompoSuite-Offline	Success Rate (Test)	0.96	0.96	0.00
CompoSuite-Offline	Success Rate (Test)	0.19	0.45	+0.26
CompoSuite-Offline	Success Rate (Test)	0.00	0.00	0.00
Compositional Sampling: Training on Expert tasks (76) + Warmstart tasks (148), testing on 32 Zero-shot tasks. Tests ability to combine high-quality components.
CompoSuite-Offline	Success Rate (Zero-Shot)	0.20	0.44	+0.24
Restricted Sampling: Severe restriction where a specific element (e.g., IIWA robot) is seen in only one task during training. Tests strong compositional generalization.
CompoSuite-Offline	Success Rate (Zero-Shot)	0.00	0.03	+0.03

Main Takeaways

Compositional architectures (CP-IQL) generally outperform monolithic ones (IQL) in zero-shot generalization, especially when data quality is mixed (Expert + Warmstart).
Behavioral Cloning works well for Expert data but collapses to 0% success on Medium-Replay data, confirming its brittleness to multimodal/noisy distributions.
Current offline RL methods are unable to extract compositional structure from restricted data (Restricted Sampling), failing to generalize when an element is seen in only one context.
The proposed datasets provide a gradient of difficulty (Expert -> Medium -> Warmstart -> Replay) that effectively distinguishes the capabilities of different algorithms.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policies, Rewards)
Offline RL (Distribution shift, Conservative updates)
Neural Network Architectures (MLPs, Modular Networks)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Offline RL: Reinforcement learning where the agent learns from a fixed, previously collected dataset without interacting with the environment during training

BC: Behavioral Cloning—a supervised learning approach that trains an agent to mimic the actions in the dataset

IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying out-of-distribution actions by treating the value function update as an expectile regression

PPO: Proximal Policy Optimization—an on-policy RL algorithm used here to generate the data for the datasets

SAC: Soft Actor-Critic—an off-policy RL algorithm used here to generate the 'warmstart' dataset

Compositional RL: RL approaches where tasks are decomposed into functional modules (e.g., 'pick', 'place', 'robot arm') that can be recombined to solve new tasks

Zero-shot Generalization: The ability of a model to solve a task it has never seen before during training, relying on knowledge transfer from related tasks

Warmstart: Data collected during the early stages of training (low success rate), simulating a scenario where limited online RL was performed

Medium-Replay: A dataset consisting of the replay buffer of an agent trained up to medium performance, containing a mix of poor and decent trajectories

CompoSuite: A simulated robotic manipulation benchmark consisting of 256 tasks created by composing robot arms, objects, obstacles, and objectives