← Back to Paper List

Embodied AI Agents: Modeling the World

Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hervé Jégou, A. Lazaric, Arjun Majumdar, Andrea Madotto, F. Meier, Florian Metze, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, J. Malik
Meta AI Research
arXiv.org (2025)
Agent MM Memory Reasoning Speech

📝 Paper Summary

Embodied AI World Models Human-AI Interaction
Embodied AI requires transitioning from generative next-token prediction to predictive world models that integrate perception, memory, and planning to effectively reason about and interact with physical environments.
Core Problem
Current generative models (LLMs/VLMs) are inefficient for embodied tasks because they prioritize high-detail creative generation over the reasoning, planning, and physical understanding required for consistent interaction.
Why it matters:
  • Generative models often hallucinate physical actions or constraints, making them unreliable for real-world robotics or wearable assistance
  • Predicting every pixel or token is computationally inefficient compared to predicting abstract representations of future states needed for planning
  • Disembodied web agents lack the ego-centric perception required to assist users with physical tasks like cooking or assembly
Concrete Example: When a wearable agent attempts to guide a user through a recipe, a standard VLM might hallucinate a step or fail to track the user's progress because it lacks a persistent world model, whereas the proposed approach would maintain a state representation of the 'physical world' to plan the next instruction accurately.
Key Novelty
World Modeling Framework for Embodied Agents
  • Proposes replacing generative next-token prediction with 'World Models' (often based on JEPA architectures) that predict abstract states and action consequences
  • Integrates 'Mental World Models' (understanding user intent/social context) alongside 'Physical World Models' (understanding environment physics)
  • Unifies three distinct agent types (Virtual, Wearable, Robotic) under a single framework relying on multimodal perception and memory
Evaluation Highlights
  • Released the Seamless Interaction dataset containing over 4,000 hours of dyadic (two-person) interactions for training social agents
  • Developed 'Meta Motivo', a behavioral foundation model that controls physics-based humanoid avatars via zero-shot prompting
  • established that VLMs outperform LLMs and Diffusion Models on a custom WordPrediction benchmark for action planning (qualitative result)
Breakthrough Assessment
7/10
A strong position paper/survey from a major lab outlining a strategic shift toward World Models and JEPA. While it introduces significant datasets (Seamless) and models (Motivo), the provided text lacks detailed quantitative benchmarks for the core world modeling claims.
×