MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation

📝 Paper Summary

Vision-and-Language Navigation (VLN) Embodied Agents Zero-shot Navigation

MapGPT enables zero-shot vision-and-language navigation by constructing an online topological map converted into text prompts, allowing GPT-4 to perform global adaptive path planning.

Core Problem

Existing zero-shot VLN agents rely solely on local observations and lack a global memory or map, causing them to wander aimlessly when they make mistakes or need to backtrack.

Why it matters:

Without a global view, agents cannot correct erroneous exploration or perform strategic backtracking, leading to navigation failure
Current methods rely on complex multi-expert systems to summarize history, which is inefficient and leads to information loss compared to a unified global map approach

Concrete Example: When an agent relying on local action spaces realizes it has explored a wrong path, it can only continue to explore the immediate surroundings randomly because it does not remember the structure of previously visited nodes to backtrack effectively.

Key Novelty

Map-Guided Prompting with Adaptive Planning

Converts an online topological graph (nodes and edges) into a linguistic text format (e.g., 'Place A is connected with Place B') that LLMs can understand directly without GPS coordinates
Introduces an iterative planning mechanism where the agent outputs a multi-step 'Plan' at each step, updating it dynamically based on the map to support backtracking and systematic exploration

Architecture

The complete MapGPT pipeline showing how task descriptions, fundamental inputs (History, Observation, Action Space), and the Map are processed by the Prompt Manager and fed into the LLM.

Evaluation Highlights

Achieves 31.6% Success Rate (SR) on the REVERIE benchmark, surpassing some supervised learning-based methods
Reduces token consumption significantly: ~672 input tokens per step compared to NavGPT's 2,465 tokens, due to a streamlined single-expert prompt design
Reported ~10% and ~12% improvements in Success Rate (SR) on R2R and REVERIE datasets respectively compared to state-of-the-art zero-shot agents

Breakthrough Assessment

8/10

Strongly addresses the 'memory' and 'global view' deficit in zero-shot LLM agents by successfully encoding topological maps into text prompts, achieving SOTA zero-shot results.

⚙️ Technical Details

Problem Definition

Setting: Vision-and-Language Navigation (VLN) where an agent navigates indoor environments to find a target based on linguistic instructions and visual observations

Inputs: Natural language instruction I, History H_t, Visual Observation O_t (images of viewpoints), Action Space A_t

Outputs: Action selection a_t (move to node or stop) and Planning P_t

Pipeline Flow

Visual Perception (Image Capture)
Map Update (Online Topological Graph)
Prompt Manager (Text Generation)
LLM Inference (Planning & Action)

System Modules

Visual Perception

Capture observations of navigable points

Model or implementation: BLIP-2 and Faster R-CNN (for text-based GPT-4) OR Raw Images (for GPT-4V)

Map Constructor

Maintain and update an online topological graph of the environment

Model or implementation: Graph update algorithm (DUET-style)

Prompt Manager

Convert the topological graph, history, and observations into structured text prompts

Model or implementation: Rule-based formatting

Navigation Expert

Generate thought, multi-step plan, and select next action

Model or implementation: GPT-4 or GPT-4V

Novel Architectural Elements

Linguistic-formed topological map injection: translating graph connectivity directly into prompt text (e.g., 'Place X is connected with Place Y')
Iterative Adaptive Planning loop: requires the model to output and update a text-based 'Plan' at every step alongside the action

Modeling

Base Model: GPT-4 (text-only) or GPT-4V (multimodal)

Compute: Input tokens per step: ~672 (MapGPT) vs 2465 (NavGPT)

Comparison to Prior Work

vs. NavGPT: MapGPT uses a global topological map instead of just local views; uses a single expert instead of three; significantly lower token cost
vs. DiscussNav: MapGPT supports REVERIE (high-level instructions) in addition to R2R; focuses on map-guided planning rather than consensus/discussion
vs. DUET: MapGPT is zero-shot (no training) while DUET requires large-scale training data

Limitations

Reliance on the context window limit of LLMs, though optimized to be smaller than NavGPT
Performance still lags behind fully supervised state-of-the-art methods (implied, though it beats 'some' supervised)
Map understanding is linguistic, which may be less precise than metric embeddings for fine-grained movements

Reproducibility

Code: https://chen-judge.github.io/MapGPT/

Code is publicly available at https://chen-judge.github.io/MapGPT/. The method uses off-the-shelf models (GPT-4, BLIP-2, Faster R-CNN) and does not require training.

📊 Experiments & Results

Evaluation Setup

Zero-shot navigation in indoor environments using photorealistic simulators

Benchmarks:

R2R (Vision-and-Language Navigation (Fine-grained instructions))
REVERIE (Vision-and-Language Navigation (High-level instructions/Object finding))

Metrics:

Success Rate (SR)
Success weighted by Path Length (SPL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
R2R	Average Input Tokens per Step	2465	672	-1793
R2R	Average Output Tokens per Step	317	115	-202

Experiment Figures

Conceptual comparison between Local-View Agents (NavGPT) and Global-View Agents (MapGPT).

Main Takeaways

MapGPT achieves state-of-the-art zero-shot performance, reporting ~10% SR improvement on R2R and ~12% SR improvement on REVERIE over previous zero-shot agents.
In REVERIE, MapGPT achieves 31.6% SR, which is competitive even against some methods trained specifically on the dataset.
The single-expert design is far more efficient (approx 3.6x fewer input tokens) than multi-expert designs like NavGPT while achieving better performance.
The adaptive planning mechanism allows the agent to backtrack and explore systematically, overcoming the 'local minimum' trap of previous zero-shot agents.

📚 Prerequisite Knowledge

Prerequisites

Vision-and-Language Navigation (VLN) concepts
Topological Maps (Graph-based navigation)
Large Language Models (LLMs) prompting

Key Terms

VLN: Vision-and-Language Navigation—a task where agents navigate real-world 3D environments following natural language instructions

R2R: Room-to-Room—a VLN dataset containing detailed, step-by-step navigation instructions

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments—a VLN dataset with high-level instructions referring to remote objects

Topological Map: A map representation using a graph of nodes (locations) and edges (connectivity) rather than precise metric coordinates

Zero-shot: The ability of the model to perform the task without any specific training or fine-tuning on the target dataset

SR: Success Rate—the percentage of navigation episodes where the agent successfully stops at the target location

GPT-4V: GPT-4 with Vision—a multimodal large language model capable of processing both text and images