← Back to Paper List

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Wenqi Zhang, Mengna Wang, Gangao Liu, Hui Xu, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang
College of Computer Science and Technology, Zhejiang University, Institute of Software, Chinese Academy of Sciences, Alibaba Group
arXiv.org (2025)
MM Agent Reasoning RL

📝 Paper Summary

Embodied AI Visual Reasoning Agentic AI
Embodied-Reasoner adapts o1-style deep thinking to physical agents by training a VLM on synthetic observation-thought-action trajectories to enable spatial reasoning and self-correction.
Core Problem
Current reasoning models excel at static math/code but fail at embodied tasks requiring continuous visual feedback, spatial understanding, and self-correction over long horizons.
Why it matters:
  • Standard VLMs (Vision-Language Models) struggle to process lengthy, interleaved image-action histories, leading to repetitive or inconsistent behaviors in physical environments
  • Mathematical reasoning models rely on logical deduction, whereas embodied agents need distinct capabilities like spatial reasoning, temporal recall, and commonsense physical inference
  • Even advanced models like OpenAI o3-mini frequently fail to exhibit robust reasoning in interactive tasks, getting stuck in loops or making illogical navigational decisions
Concrete Example: When searching for a hidden keychain, a standard agent might repeatedly open the same empty drawer or search random locations. In contrast, Embodied-Reasoner explicitly generates thoughts to 'recall relevant cues from previous attempts' and infer that if the keychain isn't on the table, it might be in the drawer, avoiding redundant actions.
Key Novelty
Embodied Chain-of-Thought (Observation-Thought-Action)
  • Integrates explicit 'thinking' steps (analysis, planning, reflection) between visual observations and physical actions, similar to OpenAI o1 but adapted for physical interaction
  • Uses a synthetic data engine to generate training trajectories that include not just actions, but the *internal monologue* explaining why an action was chosen (e.g., 'I see a fridge, I should check it for the egg')
  • Implements a three-stage training pipeline: Imitation Learning for basic skills, Rejection Sampling for exploration, and Reflection Tuning to learn self-correction from failure
Architecture
Architecture Figure Figure 2 & 3
Overview of the data synthesis engine and the three-stage training pipeline.
Evaluation Highlights
  • Outperforms OpenAI o1 by +9% in success rate across embodied tasks in the AI2-THOR simulator
  • Surpasses OpenAI o3-mini by +24% in success rate, demonstrating superior handling of visual-interactive contexts
  • Achieves +39.9% higher success rate on complex composite tasks (multi-step transportation) compared to the second-best model, showing strong long-horizon planning
Breakthrough Assessment
8/10
successfully transfers the 'slow thinking' paradigm (o1-style) to embodied AI, addressing a major gap in how reasoning models handle continuous, visual environments.
×