Privileged Sensing Scaffolds Reinforcement Learning

📝 Paper Summary

Model-Based Reinforcement Learning Privileged Information Sim-to-Real Transfer Robotic Manipulation

Scaffolder trains robots using extra sensors available only during training (like cameras or motion capture) to build better internal simulators, allowing the final robot to perform well with only basic sensors.

Core Problem

Robots often train with access to rich sensor data (privileged information) but must deploy with cheap, limited sensors; existing methods struggle to effectively transfer this rich training knowledge to the limited deployment policy.

Why it matters:

Real-world robots often rely on cheap, robust sensors (e.g., RGB cameras) due to cost and durability, while training environments can support expensive instrumentation (e.g., motion capture).
Current methods typically use privileged data only for reward calculation or value estimation, missing opportunities to improve the robot's understanding of world dynamics and exploration.

Concrete Example: A robot arm must pick up a block using only touch sensors (blind). During training, a camera reveals the block's location. Standard RL ignores the camera for the policy, making learning slow or impossible because the robot cannot 'see' the block to learn the task initially.

Key Novelty

Scaffolded Model-Based Reinforcement Learning

Trains a 'scaffolded' world model using privileged sensors (e.g., cameras) to accurately simulate the environment, rather than relying on a weak model built from limited target sensors.
Uses this accurate simulator to train the limited target policy by translating rich simulator states into the limited observations the policy expects via a 'transdecoder'.
Employes a privileged exploration policy that uses the extra sensors to collect better training data, guiding the blind target policy toward useful behaviors it couldn't find alone.

Architecture

The training architecture of Scaffolder, contrasting the Target World Model (used for test-time encoding) with the Scaffolded World Model (used for training-time imagination).

Evaluation Highlights

Outperforms DreamerV3 baseline by +64% success rate on the 'Visual Occlusion' task where a robot must reach behind an obstacle.
Achieves 85% success rate on 'PandaPick' using only touch sensors, matching the performance of a policy that has full access to camera data at test time.
Surpasses prior state-of-the-art methods (privileged critics, distillation) across a new suite of 10 diverse robotic tasks, including dexterous manipulation and blind locomotion.

Breakthrough Assessment

8/10

Proposes a comprehensive framework for leveraging privileged information in MBRL, consistently outperforming baselines across diverse tasks. The 'S3' benchmark suite is also a valuable contribution.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) with two observation sets: target observations (available at train and test) and privileged observations (available only at train).

Inputs: Target observations o_t^- (e.g., proprioception, touch) and privileged observations o_t^p (e.g., camera images, state info).

Outputs: Action a_t to maximize expected return using policy π(a_t | o_t^-).

Pipeline Flow

Encoder (embeds target observations)
Recurrent Model (updates latent state)
Actor (selects action)

System Modules

Target Encoder

Embeds the limited target observations into a lower-dimensional vector.

Model or implementation: MLP or CNN (depending on input modality)

Target Recurrent Model (RSSM)

Integrates observation history to maintain a belief state z_t^-.

Model or implementation: Recurrent State-Space Model (RSSM)

Actor Policy

Selects the optimal action based on the target latent state.

Model or implementation: MLP

Novel Architectural Elements

Nested Latent Imagination (NLI): A training-time architecture where the target policy (operating on target latents) is rolled out inside a privileged world model (operating on privileged latents) via a 'Transdecoder' bridge.
Scaffolded World Model: A secondary, high-fidelity RSSM trained on privileged observations used solely for generating training data and calculating rewards/values.

Modeling

Base Model: DreamerV3 (small variant)

Training Method: Model-Based Reinforcement Learning (MBRL)

Objective Functions:

Purpose: Maximize expected return using the scaffolded world model.

Formally: Maximize E[TD-lambda return] computed using scaffolded value v+(z-) and reward p+(r|z+).
Purpose: Train the world model dynamics.

Formally: Minimize KL divergence between posterior and prior dynamics + reconstruction loss of observations and rewards.
Purpose: Train the Transdecoder.

Formally: Maximize log likelihood of reconstructed target observations p(o^- | z+) given privileged latent states.

Adaptation: Scaffolding modifies the training process of DreamerV3; the final model architecture matches standard DreamerV3.

Trainable Parameters: Not reported (uses DreamerV3 configs)

Training Data:

Collected online via interaction with the simulator.
Exploration data is a 1:1 mix of rollouts from the target policy and a privileged exploration policy.

Key Hyperparameters:

batch_size: 16
batch_length: 64
return_lambda: 0.95
+ 2 more
learning_rate: 3e-4 (Actor), 3e-5 (Critic), 1e-4 (Model)
training_steps: 1e6 environment steps (typically)

Compute: Single NVIDIA 3090 GPU (implied by 24GB VRAM mention in Appendix C.3)

Comparison to Prior Work

vs. DreamerV3: Scaffolder uses a separate privileged world model for imagination, whereas DreamerV3 uses the target model.
vs. AAC: Scaffolder benefits from privileged dynamics and exploration, not just value estimation.
vs. Teacher-Student: Scaffolder optimizes the target policy directly with RL rather than imitation, avoiding the imitation gap.
+ 1 more
vs. Informed Dreamer: Scaffolder replaces the entire simulation dynamics with the privileged model, rather than just using it for representation shaping [not cited in paper].

Limitations

Requires privileged sensors to be available and synchronized during training.
Increases computational cost during training due to maintaining two world models (target and scaffolded).
Does not address the problem of sim-to-real transfer directly, though it is motivated by it.

Reproducibility

Code: https://penn-pal-lab.github.io/scaffolder/

Code is publicly available. S3 suite environments are provided. Hyperparameters are detailed in Appendix C. Training curves and ablation studies are included.

📊 Experiments & Results

Evaluation Setup

10 simulated robotic tasks (S3 Suite) covering manipulation, locomotion, and active perception.

Benchmarks:

S3 Suite (Robotic Control (POMDP)) [New]

Metrics:

Success Rate
Episode Return
Statistical methodology: Means over 5 seeds with standard deviation shading in plots (implicitly reported).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaffolder consistently outperforms baselines across diverse tasks in the S3 suite.
S3 Suite - Visual Occlusion	Success Rate	0.33	0.97	+0.64
S3 Suite - PandaPick	Success Rate	0.55	0.85	+0.30
S3 Suite - Hurdles	Episode Return	250	650	+400
Ablation studies confirm the importance of each Scaffolder component.
S3 Suite - PandaPick	Success Rate	0.20	0.85	+0.65
S3 Suite - Hurdles	Episode Return	300	650	+350

Experiment Figures

Success rates and returns across 10 tasks in the S3 suite, comparing Scaffolder against DreamerV3, Asymmetric Actor-Critic (AAC), and Teacher-Student baselines.

A schematic of the 'Nested Latent Imagination' process.

Main Takeaways

Scaffolder consistently outperforms both vanilla DreamerV3 and standard privileged baselines (AAC, Teacher-Student) across the diverse S3 suite.
All three components (Scaffolded World Model, Privileged Exploration, Representation Learning) contribute to performance, with their relative importance varying by task.
The method is particularly effective in settings with severe partial observability (e.g., blind manipulation, visual occlusion) where the target policy struggles to learn dynamics.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Model-Based RL (specifically Dreamer architecture)
POMDPs (Partially Observable Markov Decision Processes)
Latent variable models

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MBRL: Model-Based Reinforcement Learning—learning a model of the environment's dynamics to simulate and optimize behavior.

DreamerV3: A state-of-the-art model-based RL algorithm that learns a latent world model to generate synthetic experience for policy training.

RSSM: Recurrent State-Space Model—a specific neural network architecture used in Dreamer to model dynamics using both deterministic and stochastic components.

Scaffolder: The proposed method that uses privileged sensors to improve world models, critics, and exploration for a target policy.

Privileged Information: Data available only during training (like ground-truth states or extra camera views) but not during deployment.

POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly observe the full state of the world.

Transdecoder: A neural component in Scaffolder that maps privileged latent states to predicted target observations, enabling the target policy to run inside the scaffolded world model.

S3 Suite: Sensory Scaffolding Suite—a new benchmark of 10 simulated robotic tasks designed to evaluate agents with limited test-time sensors.

TD-lambda: Temporal Difference lambda—a method for estimating the value of a state by combining rewards over multiple future steps.

Critic: A neural network that estimates the value (expected future reward) of a state or action.

World Model: A learned simulator that predicts future states and rewards given current states and actions.

Latent State: A compressed internal representation of the environment state learned by the neural network.