SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

📝 Paper Summary

Robotic Manipulation Imitation Learning Reward Modeling

SARM introduces a hierarchical reward model that predicts task stages and fine-grained progress from video to filter and reweight noisy demonstrations for robust behavior cloning.

Core Problem

Robotic imitation learning for long-horizon, deformable object tasks struggles with inconsistent demonstration quality and the failure of simple frame-index labels to capture true progress.

Why it matters:

Large datasets often contain suboptimal or noisy trajectories from inexperienced operators, degrading policy performance
Standard progress metrics (like time elapsed) fail when task duration varies significantly, such as in folding clothes
Current Robot Behavior Models (RBMs) struggle to generalize beyond curated expert data in contact-rich settings

Concrete Example: In T-shirt folding, the flattening phase may take 10 seconds or 30 seconds depending on the initial crumple. A frame-based labeler would assign different progress values (e.g., 0.2 vs 0.8) to the exact same 'fully flattened' state based solely on time, confusing the policy.

Key Novelty

Stage-Aware Reward Modeling (SARM) + Reward-Aligned Behavior Cloning (RA-BC)

Decomposes reward prediction into two heads: a classifier for high-level semantic stages and a regressor for fine-grained progress within that stage
Derives consistent ground-truth labels from natural language subtask annotations rather than raw time indices
Uses the learned reward to reweight training data in Behavior Cloning, effectively filtering out non-progressing or noisy segments without needing explicit expert/non-expert labels

Architecture

The dual-head architecture of the SARM reward model.

Evaluation Highlights

83% success rate on real-world T-shirt folding from a flattened state (vs. 8% for vanilla Behavior Cloning)
67% success rate on real-world T-shirt folding from a crumpled state (vs. 0% for vanilla Behavior Cloning)
Outperforms VLM-based reward models (LIV, ReWiND) in correlation with human progress ranking on unseen test data

Breakthrough Assessment

8/10

Demonstrates a massive performance jump (0% to 67%) on a very difficult real-world task (crumpled T-shirt folding) by addressing data quality, a critical but often overlooked factor in scaling robot learning.

⚙️ Technical Details

Problem Definition

Setting: Vision-based Imitation Learning from heterogeneous, suboptimal demonstrations

Inputs: Sequence of RGB images (top, left wrist, right wrist views) and joint states

Outputs: Robot joint actions (for policy) or Progress score in [0,1] (for reward model)

Pipeline Flow

Visual Encoder (CLIP)
Multimodal Projector
Transformer Encoder
Dual Prediction Heads (Stage & Subtask)

System Modules

Visual Encoder

Extract visual features from video frames

Model or implementation: Frozen CLIP encoder

Transformer Encoder

Process temporal dependencies and cross-modal interactions

Model or implementation: Transformer

Stage Head (Prediction)

Classify the current high-level semantic stage of the task

Model or implementation: MLP classifier

Subtask Head (Prediction)

Regress fine-grained progress within the predicted stage

Model or implementation: MLP regressor (conditioned on stage info)

Novel Architectural Elements

Dual-head architecture explicitly separating discrete stage classification from continuous intra-stage progress regression
Hierarchical progress estimation where global progress is a composition of stage probability and local subtask progress

Modeling

Base Model: CLIP (Visual Backbone) + Transformer (Temporal Aggregation)

Training Method: Supervised Learning for Reward Model; Weighted BC for Policy

Objective Functions:

Purpose: Train the reward model to predict stage and progress.

Formally: Cross-entropy loss for stage classification + MSE loss for subtask progress regression against heuristically derived labels.
Purpose: Train the policy using weighted behavior cloning.

Formally: Weighted negative log-likelihood of actions, where weights w_i are derived from the reward model's predicted progress delta.

Training Data:

Real-world robot data (ALOHA hardware)
Annotated with start/end frames of semantic subtasks
Mistake trajectories excluded from reward model training but potentially present in large-scale policy data

Key Hyperparameters:

clip_epsilon: Not reported in the paper
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LIV/ReWiND: SARM uses explicit stage supervision and local progress linearity, whereas LIV/ReWiND rely on global embedding distances which fail in cyclic/long tasks
vs. Vanilla BC: SARM reweights data based on estimated progress, filtering out pauses and noise
vs. DWBC (Discriminator-Weighted BC) [not cited in paper]: DWBC trains a discriminator between expert/non-expert data; SARM learns a continuous progress metric from annotated structure, not requiring explicit 'bad' datasets

Limitations

Relies on manual annotation of subtask boundaries for the reward model training set
Assumes a fixed sequence of subtasks (linear topology), potentially limiting flexibility for tasks with multiple valid orderings
Computationally more intensive than simple frame-based heuristics due to the video-based reward model inference

Reproducibility

Code: https://qianzhong-chen.github.io/sarm.github.io/

Project website provided. Code availability statement mentions public availability, but specific hyperparameters (LR, batch size) are missing from the text. Dataset details (subtask protocols) are well-documented.

📊 Experiments & Results

Evaluation Setup

Real-world robotic manipulation with ALOHA hardware.

Benchmarks:

T-Shirt Folding (Flattened) (Deformable object manipulation) [New]
T-Shirt Folding (Crumpled) (Deformable object manipulation (Long-horizon)) [New]

Metrics:

Success Rate (Real Robot)
Kendall's Tau (Correlation with human ranking)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Real-world robot policy performance comparing SARM (RA-BC) against standard Behavior Cloning (BC) and other weighting schemes.
T-Shirt Folding (Flattened)	Success Rate	8	83	+75
T-Shirt Folding (Crumpled)	Success Rate	0	67	+67
T-Shirt Folding (Flattened)	Success Rate	0	83	+83

Experiment Figures

Qualitative visualization of reward traces for SARM vs. baselines (LIV, ReWiND, Time-Index) on a T-shirt folding trajectory.

Main Takeaways

Reward-aligned weighting is critical for learning from diverse/noisy demonstrations in long-horizon tasks.
Standard Behavior Cloning collapses (0-8% success) on the T-shirt folding dataset, likely due to multimodal distributions and pauses/mistakes in data.
SARM provides a much smoother and monotonic progress signal compared to baseline reward models (LIV, ReWiND), which exhibit noise and local minima.
The decomposition into stages and subtask progress is essential for handling variable-duration tasks where simple time-indexing fails.

📚 Prerequisite Knowledge

Prerequisites

Imitation Learning / Behavior Cloning
Robot Manipulation (specifically deformable objects)
Vision-Language Models (CLIP encoders)
Basic Neural Network architectures (Transformers, MLPs)

Key Terms

SARM: Stage-Aware Reward Modeling—the proposed framework for estimating task progress using hierarchical stage and subtask predictions

RA-BC: Reward-Aligned Behavior Cloning—a training method that weights imitation learning samples based on their estimated progress/reward

RBM: Robot Behavior Models—general-purpose policies that integrate perception and control for robotic tasks

CLIP: Contrastive Language-Image Pre-training—a model used here to encode visual observations into embeddings

Deformable Object: Objects like fabric or clothes that change shape when manipulated, making state estimation and planning difficult

Behavior Cloning (BC): A supervised learning approach where a policy is trained to minimize the error between its predicted actions and expert demonstrations

Welford's Algorithm: A numerically stable method for computing running mean and variance, used here to normalize reward weights online

VLM: Vision-Language Model—models that process both images and text, often used as baselines for reward estimation

Subtask: A semantic segment of a long-horizon task (e.g., 'grasp left sleeve'), used to ground progress labels