← Back to Paper List

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma
Foshan Graduate School of Innovation at Northeastern University, School of Aeronautic Science and Engineering at Beihang University
arXiv (2026)
Memory MM Agent RAG Reasoning

📝 Paper Summary

Memory recall Memory organization Self-evolving Agentic reasoning
CMMR-VLN enables navigation agents to continually improve by retrieving past multimodal experiences to guide current decisions and selectively updating memory with successful routes or key failure reflections.
Core Problem
LLM-based navigation agents lack the ability to recall and utilize relevant prior experiences, often leading to random choices at ambiguous forks or repeating past mistakes in long-horizon tasks.
Why it matters:
  • Current agents fail to adapt to unfamiliar environments over time, unlike humans who become experts through accumulated experience
  • Without structured memory, LLMs struggle to ground their vast general knowledge into specific spatial contexts, causing inconsistent decision-making
  • Purely reactive LLM agents often lack the structured logic required to maintain coherence across long navigation trajectories
Concrete Example: When instructed to 'turn left again and wait near the couch,' an agent might see two similar rooms with couches (Place 5 and Place 6). Without memory, it guesses randomly. CMMR-VLN recalls a prior failure at Place 5 and explicitly reasons to choose Place 6 to avoid repeating the mistake.
Key Novelty
Continual Multimodal Memory Retrieval (CMMR)
  • Constructs a memory bank of panoramic images and text landmarks, indexed by CLIP embeddings, allowing the agent to retrieve 'rules' derived from past similar situations
  • Implements a reflection mechanism that updates memory differently for success (storing full paths) versus failure (storing only the specific decision point and error type of the first mistake)
Evaluation Highlights
  • +52.9% improvement in Success Rate (SR) over NavGPT on the R2R validation unseen split
  • +50% improvement in Success weighted by Path Length (SPL) over MapGPT on the R2R validation unseen split
  • +200% improvement in Success Rate (SR) over NavGPT in real-world TurtleBot 4 Lite tests
Breakthrough Assessment
7/10
Significant performance jumps over LLM-based baselines and effective transfer to real robots. The distinct handling of success (full path) vs. failure (key error) memory is a clever, human-inspired design choice.
×