← Back to Paper List

Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, J. Jang, Yuquan Deng, Lars Lidén, Jianfeng Gao
Microsoft
Computer Vision and Pattern Recognition (2025)
MM Agent Pretraining Benchmark

📝 Paper Summary

Vision-Language-Action (VLA) models Multimodal Foundation Models Embodied AI
Magma is a unified multimodal foundation model that achieves spatial-temporal intelligence for both digital UI navigation and physical robotic manipulation by pretraining on diverse data labeled with Set-of-Mark and Trace-of-Mark.
Core Problem
Existing Vision-Language-Action (VLA) models are typically trained separately for specific domains (2D UI vs. 3D robotics) and often sacrifice generic multimodal understanding for task-specific action policies.
Why it matters:
  • Current approaches require separate models for digital and physical worlds, limiting generalization
  • Simply combining heterogeneous datasets fails due to the gap between verbal understanding (text) and spatial action execution (coordinates/poses)
  • Valuable video data (human instructions) is hard to leverage for agent training because it lacks explicit action labels
Concrete Example: A standard VLA might learn to 'click a button' based on 2D coordinates but fail to transfer that understanding to 'picking up a cup' with a robot arm because the action spaces (2D vs. 7-DoF) and visual representations are treated as distinct, disjoint tasks.
Key Novelty
Unified Spatial-Temporal Training via Visual Prompting (SoM & ToM)
  • Transforms diverse datasets (images, videos, UI, robotics) into a unified format where actionable objects are overlaid with visual markers (Set-of-Mark)
  • Converts unlabeled videos into action-supervision data by tracking object movement over time (Trace-of-Mark), forcing the model to predict future trajectories as a surrogate for planning
  • Uses a single model to handle verbal tasks (QA), 2D spatial tasks (UI navigation), and 3D physical tasks (robotics) without architectural branching
Evaluation Highlights
  • Achieves State-of-the-Art (SOTA) on UI navigation benchmarks (Mind2Web, AITW) and robotic manipulation (Bridge, LIBERO), outperforming domain-specific models
  • Attains SOTA on the BLINK benchmark without instruction fine-tuning, demonstrating strong zero-shot spatial grounding
  • Maintains competitive performance on standard Vision-Language benchmarks (GQA, VideoMME) compared to much larger LMMs, proving it retains verbal intelligence
Breakthrough Assessment
9/10
Successfully unifies UI agents and robotic agents into a single foundation model while improving performance on both. The Trace-of-Mark technique effectively unlocks video data for action pretraining.
×