GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

📝 Paper Summary

Embodied AI Visual Navigation Lifelong Learning

GOAT-Bench evaluates embodied agents on navigating to sequences of open-vocabulary targets specified via images, language, or categories within the same environment to test lifelong learning and memory capabilities.

Core Problem

Existing navigation benchmarks are typically single-modality (only objects or only points) and episodic (resetting the environment after each goal), failing to test an agent's ability to handle diverse inputs or leverage spatial memory over time.

Why it matters:

Real-world robots must handle diverse user commands (e.g., 'find the oven' vs. showing a photo of a specific toy) without switching models
Resetting memory after every task is inefficient; persistent memory allows robots to navigate faster to previously visited areas
Current methods tailored to single modalities (like ObjectNav) fail to generalize to instance-specific image or language goals

Concrete Example: An agent is asked to find a 'recliner chair' (category goal). After finding it, it is shown an image of a specific oven (image goal). Finally, it is told to find 'the white book on the coffee table' (language goal). Current agents treat these as isolated tasks, forgetting the coffee table's location seen while searching for the chair.

Key Novelty

Multi-Modal Lifelong Navigation Benchmark (GOAT-Bench)

Introduces a lifelong episode structure where agents must solve 5-10 sequential subtasks in the same scene without resetting, incentivizing memory usage
Integrates three distinct goal modalities (Category, Language Description, Image) into a single evaluation protocol using open-vocabulary targets
Benchmarks both modular (map-based) and monolithic (end-to-end RL) approaches to analyze trade-offs in efficiency and robustness

Evaluation Highlights

Modular methods with explicit memory achieve ~1.5x efficiency (SPL) improvement in later subtasks compared to the start of an episode, validating lifelong learning benefits
Removing memory from Modular GOAT drops efficiency (SPL) by nearly 50% (17.6 to 9.4), highlighting the critical role of persistent mapping
End-to-end RL policies (SenseAct-NN) are more robust to noise (e.g., synonyms) than modular methods but suffer from poor efficiency (SPL) due to lack of effective mapping

Breakthrough Assessment

8/10

Significantly advances the field by unifying disparate navigation tasks (Object, Image, Language) into a realistic lifelong setting. The benchmark exposes severe limitations in current SOTA memory and multi-modal integration.

⚙️ Technical Details

Problem Definition

Setting: Lifelong navigation in 3D indoor environments (HM3D scenes) with sequential subtasks

Inputs: RGB-D images, GPS+Compass sensor, and a Goal g_k (Image, Category, or Language Description) for the k-th subtask

Outputs: Discrete actions: move_forward (0.25m), turn_left (30 deg), turn_right (30 deg), look_up, look_down, stop

Pipeline Flow

Group: Perception -> Mapping -> Planning (Modular Method)
Group: Perception -> Embedding -> Policy (SenseAct-NN Method)

System Modules

Goal Encoder

Encodes the multi-modal goal (Text/Image) into a vector

Model or implementation: CLIP (Text/Image) or CroCo-v2 (Image)

Semantic Mapper (Mapping (Modular only))

Projects RGB-D observations into a top-down semantic map

Model or implementation: Deterministic projection with DETIC or ground-truth semantics

Instance Memory (Mapping (Modular only))

Clusters semantic pixels into object instances and stores their features

Model or implementation: Clustering algorithm + CLIP feature extractor

Global Planner

Selects a long-term goal on the map based on the current subtask

Model or implementation: Heuristic exploration or Goal Matching (Cosine similarity)

RNN Policy

Maps observations and goal embeddings directly to actions

Model or implementation: 2-layer GRU (512-d)

Novel Architectural Elements

Instance-Specific Memory (Modular GOAT): Augments semantic maps with instance-level clustering and CLIP embeddings to handle specific object queries (Image/Lang) rather than just categories

Modeling

Base Model: CLIP-ResNet50 (Visual Encoder)

Training Method: Reinforcement Learning (VER - Variable Experience Rollout) / Modular Integration

Objective Functions:

Purpose: Maximize navigation success and efficiency.

Formally: Standard RL Reward (distance reduction + success bonus) for SenseAct-NN

Training Data:

725k training episodes
145 training scenes (HM3DSem)
264 training object categories

Key Hyperparameters:

training_steps: 500 million steps (for SenseAct-NN Monolithic)
hidden_size: 512 (GRU)
episode_length: 5-10 subtasks

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoW: GOAT uses instance-specific memory clustering rather than just semantic mapping, allowing better differentiation of specific object instances
vs. SenseAct-NN: Modular GOAT uses explicit mapping for long-term memory, whereas SenseAct rely on RNN hidden states which paper shows are less effective for efficiency (SPL)
vs. OVON [not cited in paper]: GOAT extends the open-vocabulary concept to multi-modal inputs (Image/Lang) and lifelong sequences, not just single-category episodes

Limitations

Modular methods are sensitive to noise in object detection and synonym variations
CLIP features perform poorly for capturing instance-specific details in language and image goals
SenseAct-NN methods struggle to build implicit maps, resulting in low efficiency (SPL)
Performance on language goals is significantly lower than object goals across all methods

Reproducibility

Code: https://mukulkhanna.github.io/goat-bench

Code and dataset generation scripts are publicly available at mukulkhanna.github.io/goat-bench. Uses Habitat simulator and HM3D dataset. Pre-trained weights for components (CLIP, DETIC, CroCo) are standard public artifacts.

📊 Experiments & Results

Evaluation Setup

Simulation in Habitat with HM3D scenes. Agents run sequences of 5-10 subtasks.

Benchmarks:

GOAT-Bench (Multi-Modal Lifelong Navigation) [New]

Metrics:

Success Rate (SR)
Success weighted by Path Length (SPL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Modular GOAT vs. RL-based SenseAct-NN Skill Chain on Val Unseen (generalization) and Val Seen datasets.
GOAT-Bench (Val Unseen)	Success Rate (SR)	15.2	18.5	+3.3
GOAT-Bench (Val Unseen)	SPL	9.4	10.1	+0.7
GOAT-Bench (Val Seen)	SPL	8.71	14.8	+6.09
Ablation study demonstrating the impact of memory on efficiency (SPL).
GOAT-Bench (Val Seen)	SPL	9.4	17.6	+8.2
GOAT-Bench (Val Seen)	SPL	9.0	9.4	+0.4

Experiment Figures

Performance (Success and SPL) as a function of the number of subtasks completed in an episode.

Robustness of methods to noise (Gaussian noise on images, synonyms for categories, paraphrased language).

Main Takeaways

Modular methods excel at efficiency (SPL) when given explicit memory, improving significantly over the course of an episode as the map is built
End-to-end RL (SenseAct-NN) generalizes better to unseen object categories (higher Success Rate) and is more robust to input noise, but is highly inefficient
Current CLIP representations are insufficient for fine-grained instance identification (Language/Image goals), causing low performance on non-category goals
Instance-specific memory is crucial: Modular GOAT outperforms CLIP-on-Wheels (CoW) by maintaining instance clusters rather than raw feature matching

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/VER)
Semantic Mapping
CLIP embeddings
Visual Navigation tasks (ObjectNav, ImageNav)

Key Terms

SPL: Success weighted by Path Length—a metric measuring navigation efficiency by balancing success with how close the path was to the optimal shortest path

ObjectNav: Object-Goal Navigation—navigating to an instance of a specific object category (e.g., 'chair')

ImageNav: Image-Goal Navigation—navigating to an object matching a specific query image

LanguageNav: Language-Goal Navigation—navigating to an object described by a natural language string

Lifelong Navigation: A setting where the agent performs multiple tasks sequentially in the same environment without memory resets

Open-vocabulary: The ability to recognize and navigate to object categories not explicitly seen during training

SenseAct-NN: Sensors-to-Action Neural Network—an end-to-end policy mapping raw sensors directly to actions

CLIP: Contrastive Language-Image Pre-training—a model used to embed text and images into a shared semantic space

VER: Variable Experience Rollout—a distributed reinforcement learning technique for efficient training

HM3D: Habitat-Matterport 3D—a dataset of high-fidelity 3D reconstructions of indoor spaces