Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

📝 Paper Summary

Vision-and-Language Navigation (VLN) Embodied AI Hierarchical Planning

DualVLN decouples navigation into two asynchronous systems: a slow VLM that reasons about global goals and a fast diffusion policy that executes agile, obstacle-aware local control.

Core Problem

Existing Vision-Language-Action (VLA) models use tightly coupled end-to-end pipelines that map inputs directly to discrete actions, causing high latency, jerky motion, and poor dynamic obstacle avoidance.

Why it matters:

High latency in VLA models (often >1s) makes them unsafe for real-world robots interacting with moving people
Entangling high-level reasoning with low-level control prevents models from reacting quickly to immediate collision threats
Discrete action spaces (e.g., 'move forward 0.25m') produce unnatural, fragmented robot movement compared to continuous control

Concrete Example: In a dynamic hallway, a standard VLA might freeze while processing the instruction 'go to the bedroom' and fail to dodge a pedestrian walking into its path because its inference cycle is too slow (e.g., 1.1s) to update the trajectory in time.

Key Novelty

Asynchronous Dual-System Architecture (System 2 Planner + System 1 Controller)

Decouples reasoning from execution: System 2 (VLM) 'grounds slowly' by identifying mid-term pixel goals, while System 1 (Diffusion Policy) 'moves fast' by generating smooth trajectories at high frequency
Connects systems via latent queries: Instead of just passing coordinates, System 2 passes rich latent embeddings to System 1, preserving semantic context for the local controller
Asynchronous execution: The local controller runs at 30Hz using the most recent available plan from the slower global planner, ensuring real-time responsiveness

Architecture

The dual-system architecture. Top: System 2 (VLM) processing instructions and history to output pixel/latent goals. Bottom: System 1 (DiT) taking these goals + current RGB to output trajectory.

Evaluation Highlights

System 1 achieves 0.03s inference latency (approx. 30Hz), enabling real-time continuous control, compared to 0.7s+ for the VLM planner
Optimizations reduce System 2's trajectory token inference time from 1.1s to 0.7s using KV-cache reuse
On the new Social-VLN benchmark, DualVLN maintains higher task completion rates than StreamVLN (though both suffer ~26-27% drops compared to static settings)

Breakthrough Assessment

8/10

Strong engineering of a dual-system approach that practically addresses the latency/control bottleneck of large VLMs in robotics. The introduction of the Social-VLN benchmark is also a valuable contribution.

⚙️ Technical Details

Problem Definition

Setting: Continuous Vision-and-Language Navigation (VLN) in dynamic environments

Inputs: Natural language instruction, sequence of egocentric RGB-D images

Outputs: Continuous velocity/trajectory commands (linear and angular velocities) for the robot

Pipeline Flow

System 2 (VLM Planner): Instruction + History -> Pixel Goal + Latent Features
System 1 (Policy): Latent Features + High-Freq RGB -> Continuous Trajectory

System Modules

System 2 (Global Planner) (Reasoning & Planning)

Predict mid-term pixel goals and extract latent semantic features for the controller

Model or implementation: Qwen-VL-2.5-7B (finetuned)

Latent Query Projector (Reasoning & Planning)

Extract compact semantic cues from the VLM for the diffusion policy

Model or implementation: Learnable Latent Queries (Linear Projection)

System 1 (Local Policy)

Generate smooth, obstacle-aware trajectories based on global plans and immediate vision

Model or implementation: Diffusion Transformer (DiT)

Novel Architectural Elements

Asynchronous dual-loop topology where the slow loop (System 2) updates latent conditions for the fast loop (System 1)
Latent Goal Conditioning: System 2 passes hidden states (via learnable queries) rather than just discrete text/coordinates to System 1

Modeling

Base Model: Qwen-VL-2.5-7B (System 2)

Training Method: Decoupled Sequential Training

Objective Functions:

Purpose: Train System 2 to identify navigation targets.

Formally: Farthest pixel goal grounding loss (predicting 2D coordinates of next waypoint).
Purpose: Train System 1 to generate trajectories.

Formally: Flow Matching objective minimizing MSE between predicted velocity and true velocity: L(θ) = E[||v_t - X_dot_u||^2].

Trainable Parameters: System 2: Fully unfrozen VLM. System 1: DiT and Latent Queries.

Training Data:

System 2: StreamVLN data recipe (VLN-CE trajectories projected to pixel goals).
Social-VLN: 763K social navigation episodes generated with A* replanning around dynamic humans.

Key Hyperparameters:

DiT_hidden_dim: 384
DiT_layers: 12
DiT_heads: 6
+ 2 more
Latent_projection: 3584 to 768
Trajectory_waypoints: 32

Compute: Inference: System 2 takes 0.7s; System 1 takes 0.03s. Full model runs on 1x RTX 4090 (20GB memory).

Comparison to Prior Work

vs. StreamVLN: DualVLN uses continuous control via diffusion (System 1) rather than discrete actions, enabling better obstacle avoidance.
vs. NaVILA: DualVLN decouples global planning and local control, allowing asynchronous high-frequency execution (30Hz) vs NaVILA's slower E2E inference.
vs. RT-2 [not cited in paper]: RT-2 uses a single synchronous VLA for both reasoning and control, whereas DualVLN splits these into slow/fast specialized modules.

Limitations

Performance drops significantly (~27%) in dynamic social environments compared to static ones.
Requires complex asynchronous infrastructure (two concurrent models) compared to simple end-to-end pipelines.
System 2 inference is still relatively slow (0.7s), potentially limiting responsiveness to high-level instruction changes.

Reproducibility

Project name 'InternNav'. Code URL not provided in text (likely masked/placeholder in extraction). Uses public models (Qwen-VL, DepthAnythingV2) and standard benchmarks (VLN-CE, Habitat). Social-VLN data generation pipeline described.

📊 Experiments & Results

Evaluation Setup

Instruction following in photo-realistic 3D environments (Habitat simulator)

Benchmarks:

R2R-CE (Continuous Vision-and-Language Navigation)
VLN-PE (Physically realistic VLN (locomotion control))
Social-VLN (VLN with dynamic humanoid obstacles) [New]

Metrics:

Success Rate (SR)
Navigation Error (NE)
Success weighted by Path Length (SPL)
Human Collision Rate (HCR)
Trajectory Length (TL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Latency analysis demonstrates the speed advantage of the dual-system design, enabling real-time control.
Inference Speed	Inference Time (System 1)	Not reported in the paper	0.03	-
Inference Speed	Inference Time (System 2)	1.1	0.7	-0.4

Experiment Figures

Ablation study on goal representations (w/o Sys.2 Train, w/o Pixel Goal, w/o Latent Goal).

Main Takeaways

Decoupling the VLM (System 2) from the controller (System 1) allows high-frequency control (30Hz) without sacrificing semantic reasoning.
DualVLN outperforms prior RGB-based methods on R2R-CE and VLN-PE benchmarks (qualitative result from text, exact numbers not in snippet).
Dynamic environments (Social-VLN) remain a major challenge, causing ~27% performance drops even for dual-system models, though DualVLN handles them better than single-stream baselines.
Explicit pixel goals combined with implicit latent goals perform better than either alone, as latent features provide context that simple coordinates miss.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) and their latency constraints
Diffusion Policies for robot control
Hierarchical Reinforcement Learning / Planning concepts

Key Terms

VLA: Vision-Language-Action models—systems that take vision and language inputs and directly output robot actions

Diffusion Policy: A robot control policy that generates actions by denoising random noise, conditioned on observations

Flow Matching: A generative modeling technique used here to train the diffusion policy to predict trajectory velocities

KV-cache: Key-Value cache—a memory optimization technique in Transformers to speed up inference by reusing previously computed attention representations

Pixel Goal Grounding: Identifying a specific 2D point in an image that corresponds to a navigational target (e.g., 'the door')

Social-VLN: A new benchmark proposed in this paper that introduces dynamic humanoid agents into VLN environments to test obstacle avoidance

DiT: Diffusion Transformer—a neural network architecture that uses Transformer blocks within a diffusion generation process

Q-Former: A module (from BLIP-2) that bridges a frozen image encoder and a language model by extracting a fixed number of visual tokens