← Back to Paper List

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, P. Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
arXiv.org (2025)
RL MM Benchmark

📝 Paper Summary

Robotic Manipulation Reward Modeling Reinforcement Learning
Robo-Dopamine introduces a general-purpose, multi-view process reward model (GRM) and a theoretically sound reward shaping framework (Dopamine-RL) to enable efficient reinforcement learning for high-precision robotic manipulation.
Core Problem
Applying RL to real-world robotics is hindered by ineffective reward functions: sparse rewards make exploration difficult, while handcrafted dense rewards are unscalable. Existing learned reward models lack step-aware understanding, suffer from single-view occlusion, and often induce 'semantic traps' that alter the optimal policy.
Why it matters:
  • Sparse rewards in long-horizon, contact-rich tasks make exploration prohibitively difficult for RL agents
  • Current learned Process Reward Models (PRMs) rely on single-view perception, failing when occlusions obscure fine-grained progress
  • Naive integration of dense rewards often changes the optimal policy (the 'semantic trap'), causing agents to maximize proxy rewards rather than completing the task
Concrete Example: In a manipulation task where an arm must insert a peg, a wrist-level view is essential to see alignment, but single-view models might miss this. Furthermore, a naive dense reward might encourage the robot to hover near the hole to accumulate 'progress' points without actually inserting the peg, preventing task completion.
Key Novelty
General Reward Model (GRM) with Policy-Invariant Reward Shaping
  • Trains a massive General Reward Model (GRM) on 3,400+ hours of multi-view data to predict 'hops' (relative progress) between states, fusing incremental, forward-anchored, and backward-anchored predictions
  • Introduces Dopamine-RL, which shapes rewards using the GRM's output as a potential function, theoretically guaranteeing that the dense rewards guide exploration without changing the optimal policy (avoiding the semantic trap)
Evaluation Highlights
  • GRM achieves 92.8% accuracy in progress assessment and a Value-Order Consistency (VOC) score of 0.953
  • One-shot adaptation of GRM enables a policy to improve from near-zero to 95% success rate with only 150 online rollouts (approx. 1 hour of real robot interaction)
  • Generalizes to unseen layouts, backgrounds, and object variations across 10 simulation and 8 real-world tasks
Breakthrough Assessment
9/10
Significant advance in RL for robotics. The combination of a large-scale general reward model with a theoretically sound shaping mechanism that prevents reward hacking (semantic trap) solves two major bottlenecks simultaneously.
×