RAG: Retrieval-Augmented Generation—AI systems that answer questions or make decisions by first searching for relevant data
CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images and text in a shared embedding space
SPL: Success weighted by Path Length—a metric balancing navigation success with trajectory efficiency
R2R: Room-to-Room—a standard dataset for vision-and-language navigation tasks in indoor environments
SR: Success Rate—the percentage of navigation episodes where the agent stops within 3 meters of the goal
Detic: Detector with Image Classes—an object detection model used here to extract landmark text from images
FAISS: Facebook AI Similarity Search—a library for efficient similarity search and clustering of dense vectors
NE: Navigation Error—average distance in meters from the agent's final position to the goal
OSR: Oracle Success Rate—success rate if the agent had stopped at the closest point to the goal during its path
Chain-of-Thought: A prompting technique that encourages the LLM to generate intermediate reasoning steps before the final answer