← Back to Paper List

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Fred Shentu, Philipp Wu
Stanford University, University of California, Berkeley
arXiv.org (2025)
MM RL

📝 Paper Summary

Robotic Manipulation Imitation Learning Reward Modeling
SARM introduces a hierarchical reward model that predicts task stages and fine-grained progress from video to filter and reweight noisy demonstrations for robust behavior cloning.
Core Problem
Robotic imitation learning for long-horizon, deformable object tasks struggles with inconsistent demonstration quality and the failure of simple frame-index labels to capture true progress.
Why it matters:
  • Large datasets often contain suboptimal or noisy trajectories from inexperienced operators, degrading policy performance
  • Standard progress metrics (like time elapsed) fail when task duration varies significantly, such as in folding clothes
  • Current Robot Behavior Models (RBMs) struggle to generalize beyond curated expert data in contact-rich settings
Concrete Example: In T-shirt folding, the flattening phase may take 10 seconds or 30 seconds depending on the initial crumple. A frame-based labeler would assign different progress values (e.g., 0.2 vs 0.8) to the exact same 'fully flattened' state based solely on time, confusing the policy.
Key Novelty
Stage-Aware Reward Modeling (SARM) + Reward-Aligned Behavior Cloning (RA-BC)
  • Decomposes reward prediction into two heads: a classifier for high-level semantic stages and a regressor for fine-grained progress within that stage
  • Derives consistent ground-truth labels from natural language subtask annotations rather than raw time indices
  • Uses the learned reward to reweight training data in Behavior Cloning, effectively filtering out non-progressing or noisy segments without needing explicit expert/non-expert labels
Architecture
Architecture Figure Figure 4
The dual-head architecture of the SARM reward model.
Evaluation Highlights
  • 83% success rate on real-world T-shirt folding from a flattened state (vs. 8% for vanilla Behavior Cloning)
  • 67% success rate on real-world T-shirt folding from a crumpled state (vs. 0% for vanilla Behavior Cloning)
  • Outperforms VLM-based reward models (LIV, ReWiND) in correlation with human progress ranking on unseen test data
Breakthrough Assessment
8/10
Demonstrates a massive performance jump (0% to 67%) on a very difficult real-world task (crumpled T-shirt folding) by addressing data quality, a critical but often overlooked factor in scaling robot learning.
×