← Back to Paper List

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Théophile Gervet, Matthew Chang, Z. Kira, D. Chaplot, Dhruv Batra, Roozbeh Mottaghi
Georgia Institute of Technology, Carnegie Mellon University, University of Illinois Urbana-Champaign, Mistral AI, University of Washington
Computer Vision and Pattern Recognition (2024)
MM Benchmark Agent Memory RL

📝 Paper Summary

Embodied AI Visual Navigation Lifelong Learning
GOAT-Bench evaluates embodied agents on navigating to sequences of open-vocabulary targets specified via images, language, or categories within the same environment to test lifelong learning and memory capabilities.
Core Problem
Existing navigation benchmarks are typically single-modality (only objects or only points) and episodic (resetting the environment after each goal), failing to test an agent's ability to handle diverse inputs or leverage spatial memory over time.
Why it matters:
  • Real-world robots must handle diverse user commands (e.g., 'find the oven' vs. showing a photo of a specific toy) without switching models
  • Resetting memory after every task is inefficient; persistent memory allows robots to navigate faster to previously visited areas
  • Current methods tailored to single modalities (like ObjectNav) fail to generalize to instance-specific image or language goals
Concrete Example: An agent is asked to find a 'recliner chair' (category goal). After finding it, it is shown an image of a specific oven (image goal). Finally, it is told to find 'the white book on the coffee table' (language goal). Current agents treat these as isolated tasks, forgetting the coffee table's location seen while searching for the chair.
Key Novelty
Multi-Modal Lifelong Navigation Benchmark (GOAT-Bench)
  • Introduces a lifelong episode structure where agents must solve 5-10 sequential subtasks in the same scene without resetting, incentivizing memory usage
  • Integrates three distinct goal modalities (Category, Language Description, Image) into a single evaluation protocol using open-vocabulary targets
  • Benchmarks both modular (map-based) and monolithic (end-to-end RL) approaches to analyze trade-offs in efficiency and robustness
Evaluation Highlights
  • Modular methods with explicit memory achieve ~1.5x efficiency (SPL) improvement in later subtasks compared to the start of an episode, validating lifelong learning benefits
  • Removing memory from Modular GOAT drops efficiency (SPL) by nearly 50% (17.6 to 9.4), highlighting the critical role of persistent mapping
  • End-to-end RL policies (SenseAct-NN) are more robust to noise (e.g., synonyms) than modular methods but suffer from poor efficiency (SPL) due to lack of effective mapping
Breakthrough Assessment
8/10
Significantly advances the field by unifying disparate navigation tasks (Object, Image, Language) into a realistic lifelong setting. The benchmark exposes severe limitations in current SOTA memory and multi-modal integration.
×