EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

📝 Paper Summary

Embodied AI Multi-modal Large Language Models (MLLMs) Agent Evaluation

EmbodiedBench evaluates MLLM-based agents across 1,128 diverse embodied tasks, revealing that while models succeed at high-level planning, they struggle significantly with low-level manipulation and 3D spatial reasoning.

Core Problem

Existing benchmarks for embodied agents either focus only on high-level planning or specific modules, failing to evaluate how Multi-modal LLMs handle low-level vision-driven control (navigation, manipulation) and fine-grained capabilities.

Why it matters:

Language-centric agents overlook the critical role of vision in low-level control, creating a blind spot in current evaluations.
Without standardized evaluation for low-level actions, it is unclear if MLLMs can act as direct robotic controllers or only as high-level planners.
Current benchmarks lack fine-grained diagnosis of specific failure modes like spatial awareness or long-horizon planning.

Concrete Example: In a manipulation task like 'pick up the blue box', a high-level planner simply outputs 'pick(blue_box)', but a real robot needs a 7-dimensional vector [x, y, z, roll, pitch, yaw, gripper]. Current MLLMs fail to generate these precise continuous values even when given visual inputs.

Key Novelty

Multi-level Embodied Evaluation Framework

Unifies evaluation across hierarchical action levels: from high-level semantic actions (e.g., 'slice apple') to low-level atomic actions (e.g., continuous joint control).
Introduces 'Capability-Oriented' subsets: specifically isolates 6 skills (e.g., Spatial Awareness, Long-Horizon) rather than just measuring overall success rates.
Deploys a standardized MLLM agent pipeline that integrates ego-centric vision, history, and feedback to fairly compare 24 distinct models.

Architecture

The unified MLLM agent pipeline for EmbodiedBench.

Evaluation Highlights

Proprietary models dominate but struggle: GPT-4o achieves the highest average success rate of 28.9%, significantly outperforming open-source models but still failing >70% of tasks.
Low-level manipulation is the hardest domain: GPT-4o scores only 20.3% on EB-Manipulation, compared to 52.3% on high-level EB-ALFRED tasks.
Vision is critical for low-level tasks: Removing visual input drops performance by 40%-70% in navigation/manipulation, whereas high-level planning is minimally affected.

Breakthrough Assessment

8/10

A comprehensive, much-needed benchmark that exposes the severe limitations of current MLLMs in actual robotic control (low-level actions) versus abstract planning.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) augmented with language instructions

Inputs: Language instruction L, visual observation sequence I_t, interaction history h_t

Outputs: Action a_t (either high-level semantic skill or low-level kinematic command)

Pipeline Flow

Environment Observation (Visual + Textual)
Prompt Construction (History + In-Context Examples)
MLLM Inference (Reasoning + Planning)
Action Parsing & Execution

System Modules

Prompt Constructor

Aggregates current image, instruction, history, and valid action specifications into a structured prompt

Model or implementation: Rule-based

Vision-Driven Planner

Generates a structured plan including visual description, reasoning, and executable actions

Model or implementation: Evaluated MLLM (e.g., GPT-4o, Llama-3.2-Vision)

Action Parser

Converts MLLM text output into environment-compatible commands

Model or implementation: Rule-based deterministic parser

Novel Architectural Elements

Unified agent framework supporting both high-level semantic actions and low-level continuous control via discretization and YOLO augmentation
Multi-step planning capability within a single inference turn to optimize for low-level control sequences

Modeling

Base Model: Various MLLMs evaluated (GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet, Llama-3.2-Vision, Qwen2.5-VL)

Training Method: Zero-shot or Few-shot In-Context Learning (Evaluation only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. VisualAgentBench: EmbodiedBench adds low-level navigation and manipulation tasks, enabling evaluation of kinematic control, not just semantic planning
vs. ALFRED/Habitat standard evals: EmbodiedBench standardizes the agent pipeline to evaluate *foundation models* (MLLMs) directly, rather than training specialized policies
vs. ManiSkill [not cited in paper]: ManiSkill focuses on RL training for manipulation; EmbodiedBench focuses on zero/few-shot evaluation of general-purpose MLLMs

Limitations

Current MLLMs struggle significantly with multi-image history, forcing the agent to rely primarily on the current frame.
Low-level manipulation performance is still very low (<30%), suggesting MLLMs are not yet ready for direct robotic control without auxiliary policies.
Evaluation costs for proprietary models on 1,128 tasks with long horizons can be high.

Reproducibility

Code: https://embodiedbench.github.io

Code and dataset available at https://embodiedbench.github.io. The benchmark relies on existing simulators (AI2-THOR, Habitat) which are publicly available. Proprietary model evaluations (GPT-4o, Gemini) are subject to API changes.

📊 Experiments & Results

Evaluation Setup

Zero/Few-shot evaluation of pre-trained MLLMs acting as embodied agents in 4 simulated environments.

Benchmarks:

EB-ALFRED (High-level household tasks (AI2-THOR)) [New]
EB-Habitat (High-level rearrangement tasks (Habitat)) [New]
EB-Navigation (Low-level navigation (AI2-THOR)) [New]
EB-Manipulation (Low-level robotic arm control) [New]

Metrics:

Success Rate (SR)
Goal Condition Success (GC) - for Habitat
Success weighted by Path Length (SPL) - for Navigation
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison shows proprietary models (GPT-4o, Claude) significantly outperforming open-source models, though absolute performance remains low on average.
Average across 4 tasks	Success Rate	11.1	28.9	+17.8
EB-ALFRED (High-level)	Success Rate	39.3	52.3	+13.0
EB-Manipulation (Low-level)	Success Rate	13.2	20.3	+7.1
Ablation studies reveal the critical importance of vision for low-level tasks compared to high-level tasks.
EB-Navigation	Success Rate	6.7	26.3	+19.6
EB-ALFRED	Success Rate	50.0	52.3	+2.3

Experiment Figures

Performance overview (Success Rate) of 24 MLLMs across the 4 EmbodiedBench environments.

Ablation study on the impact of vision (Vision vs. Text-Only).

Main Takeaways

MLLMs are strong high-level planners but weak low-level controllers: Success rates drop by ~30% when moving from ALFRED (high-level) to Manipulation (low-level).
Vision is essential for low-level control: While text-only agents perform comparably on high-level tasks (likely due to language priors), they fail completely on navigation and manipulation without visual feedback.
Long-horizon planning is a major bottleneck: Performance degrades significantly on the 'Long Horizon' subset across all models.
Open-source gap: Leading open-source models (Llama-3.2, Qwen2-VL) lag significantly behind GPT-4o and Gemini, particularly in complex spatial reasoning tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Multi-modal LLMs (MLLMs)
Basic robotics concepts (kinematics, action spaces, POMDP)
Familiarity with embodied simulators (AI2-THOR, Habitat)

Key Terms

Embodied Agent: An AI system controlling a physical or simulated body (robot) to perform tasks in an environment

Low-level actions: Atomic commands directly executable by robots, such as specific movement distances (meters) or joint rotations (degrees)

High-level actions: Abstract, semantic commands composed of multiple low-level primitives, like 'Find apple' or 'Pick up mug'

POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly see the entire state of the world

Ego-centric vision: Visual input captured from the robot's own perspective (first-person view)

Kinematic: Relating to the motion of points, bodies, and systems without considering the forces that cause them

AI2-THOR: A photorealistic interactive environment for embodied AI agents

Habitat: A high-performance 3D simulator for training virtual robots

YOLO: You Only Look Once—a real-time object detection system used here to provide bounding boxes to the agent

Euler angles: Three angles (roll, pitch, yaw) used to describe the orientation of a rigid body