← Back to Paper List

Robotic Manipulation Datasets for Offline Compositional Reinforcement Learning

Marcel Hussing, Jorge A. Mendez, Anisha Singrodia, Cassandra Kent, Eric Eaton
University of Pennsylvania, Massachusetts Institute of Technology
arXiv (2023)
RL Benchmark

📝 Paper Summary

Offline Reinforcement Learning (Offline RL) Compositional Generalization Robot Manipulation
This paper introduces four large-scale offline reinforcement learning datasets derived from compositional robotic tasks to evaluate how well agents can decompose skills and generalize to unseen task combinations.
Core Problem
Standard offline RL benchmarks are typically single-task or lack structured relatedness between tasks, making it difficult to study whether agents can learn reusable functional components.
Why it matters:
  • Collecting robot data is expensive; offline RL promises to reuse existing data, but current datasets don't adequately test compositional generalization (combining known skills for new tasks).
  • Existing benchmarks often lack a clear notion of task relatedness, preventing analysis of selective transfer and functional decomposition.
Concrete Example: An agent might have data on 'picking up a cup' and 'pushing a box'. A compositional benchmark tests if the agent can combine these skills to 'push a cup' without ever seeing that specific task combination in the training data. Current monolithic agents often fail this zero-shot transfer.
Key Novelty
Compositional Offline RL Datasets (CompoSuite-Offline)
  • Provides 256 million transitions across 256 tasks generated by composing 4 axes (robot, object, obstacle, objective), creating a structured grid of related tasks.
  • Includes datasets with varying quality levels (Expert, Medium, Warmstart, Medium-Replay) to simulate realistic data availability scenarios where expert demonstrations are scarce.
  • Defines specific evaluation protocols (Compositional Sampling, Restricted Sampling) to rigorously test an agent's ability to extract and recombine functional modules.
Evaluation Highlights
  • Current offline RL methods (IQL, BC) achieve varying success on training tasks (up to 96% with expert data) but fail significantly on zero-shot compositional generalization (often <20%).
  • Compositional architectures (CP-IQL) outperform monolithic baselines on zero-shot tasks (e.g., +24% success rate on Expert-Warmstart split) but still struggle with Restricted Sampling.
  • Behavioral Cloning (BC) fails completely (0% success) when trained on 'Medium-Replay' data, while IQL maintains some performance, highlighting the difficulty of learning from noisy, multi-modal offline data.
Breakthrough Assessment
7/10
While not a new algorithm, the dataset fills a critical gap in offline RL by enabling rigorous study of compositionality. The baselines' poor zero-shot performance highlights a significant open challenge for the field.
×