Don’t Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving

📝 Paper Summary

End-to-End Autonomous Driving Trajectory Planning

MomAD stabilizes autonomous driving trajectories by explicitly integrating momentum—aligning current plans with historical paths and aggregating past perception context to prevent erratic control shifts.

Core Problem

Current end-to-end planners rely on one-shot predictions from single-frame perception, leading to temporal inconsistency (jittery control), vulnerability to occlusions, and lack of long-horizon stability.

Why it matters:

Inconsistent predictions cause 'shaking' (sudden directional shifts), leading to uncomfortable and unsafe driving experiences.
One-shot multi-modal predictions are susceptible to noise and temporary occlusions, potentially causing the planner to switch abruptly between conflicting trajectories.
Without temporal coherence, vehicles may fail to maintain steady progress during complex maneuvers like turns.

Concrete Example: In a turning scenario, a standard planner might predict a smooth left turn at frame t, but at frame t+1—due to a momentary detection error—it might suddenly predict a straight path. This discontinuity forces the vehicle to jerk or 'shake' the wheel, increasing collision risk.

Key Novelty

Momentum-Aware Driving (MomAD)

Introduces 'Trajectory Momentum': Uses Hausdorff Distance to select the candidate trajectory that best preserves the shape and topology of the previous time step's path, ensuring smooth motion.
Introduces 'Perception Momentum': A module that cross-attends current planning queries with historical ones, allowing the model to 'remember' past context (like occluded agents) and refine predictions.

Architecture

Overview of the MomAD framework, detailing the Sparse Perception module, the Trajectory Prediction loop, and the two core momentum components: TTM and MPI.

Evaluation Highlights

Reduces collision rate by 26% compared to SparseDrive on the curated Turning-nuScenes validation set (6-second prediction horizon).
Improves Trajectory Prediction Consistency (TPC) by 33.45% (0.97m) over SparseDrive on Turning-nuScenes, demonstrating significantly more stable planning.
Achieves up to 16.3% improvement in success rate on the closed-loop Bench2Drive benchmark.

Breakthrough Assessment

8/10

Addresses the critical and often overlooked problem of temporal consistency in end-to-end driving. The explicit modeling of 'momentum' offers a physics-grounded solution that significantly boosts stability and safety.

⚙️ Technical Details

Problem Definition

Setting: End-to-end trajectory planning where the model predicts future waypoints given multi-view sensor inputs, while ensuring temporal consistency with history.

Inputs: Multi-view camera images

Outputs: Planned trajectory set (waypoints) and associated scores

Pipeline Flow

Sparse Perception → Instance Feature Extraction
Robust Instance Denoising (Transformer Block)
Candidate Trajectory Generation
Topological Trajectory Matching (Selection)
Momentum Planning Interactor (Query Mixing)
Planning Head (Final Prediction)

System Modules

Sparse Perception (Perception)

Extracts instance features for road agents and map elements from multi-view images

Model or implementation: SparseDrive-based Encoder

Robust Instance Denoising (Perception)

Filters noise from instance features to improve robustness against detection errors

Model or implementation: Lightweight Encoder-Decoder Transformer

Topological Trajectory Matching (TTM)

Selects the best candidate trajectory that aligns topologically with the historical path

Model or implementation: Hausdorff Distance Calculator

Momentum Planning Interactor (MPI)

Enriches the selected planning query with historical context via cross-attention

Model or implementation: Long-horizon Query Mixer (LSTM + Cross-Attention)

Planning Head

Generates the final refined trajectory based on the momentum-enriched query

Model or implementation: Transformer Decoder / Regression Head

Novel Architectural Elements

Topological Trajectory Matching (TTM): A selection module inserted into the planning loop that uses geometric topology (Hausdorff distance) rather than just probability scores to pick the base trajectory.
Momentum Planning Interactor (MPI): A recursive query-mixing architecture that cross-attends current queries with LSTM-processed historical queries to inject 'perception momentum'.

Modeling

Base Model: SparseDrive

Training Method: End-to-end training with denoising perturbations

Training Data:

nuScenes dataset
Turning-nuScenes (curated subset)

Compute: Not reported in the paper

Comparison to Prior Work

vs. SparseDrive: MomAD adds explicit temporal consistency (momentum) via TTM and MPI modules, whereas SparseDrive is one-shot [Baseline]
vs. UniAD/VAD: MomAD uses multi-modal probabilistic planning with momentum refinement, unlike the deterministic approaches in early UniAD/VAD

Limitations

Relies on the quality of upstream sparse perception; if detection fails completely, momentum can only compensate partially.
Hausdorff distance calculation adds computational overhead compared to simple Euclidean matching.
Most evaluations focus on nuScenes, which has many straight roads; the benefits are primarily visible in turning scenarios (hence the curated set).

Reproducibility

Code: https://github.com/adept-thu/MomAD

Code is publicly available at https://github.com/adept-thu/MomAD. The paper describes the architecture and logic clearly.

📊 Experiments & Results

Evaluation Setup

Open-loop evaluation on nuScenes and Turning-nuScenes; Closed-loop evaluation on Bench2Drive.

Benchmarks:

nuScenes (Open-loop trajectory prediction)
Turning-nuScenes (Open-loop trajectory prediction (turning scenarios)) [New]
Bench2Drive (Closed-loop driving simulation)

Metrics:

Collision Rate
Trajectory Prediction Consistency (TPC)
Success Rate
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of different planning paradigms: (a) Deterministic, (b) Probabilistic/Multi-modal, and (c) MomAD's Momentum-based approach.

Main Takeaways

MomAD significantly outperforms the state-of-the-art SparseDrive in complex scenarios, reducing collision rates by 26% in turning situations.
The introduction of 'momentum' (via TTM and MPI) effectively stabilizes predictions, improving temporal consistency (TPC) by over 33%.
Closed-loop performance on Bench2Drive confirms that open-loop stability gains translate to better driving success rates (+16.3%).
The method is particularly effective in long-horizon consistency (>= 3 seconds), addressing a key weakness of one-shot planners.

📚 Prerequisite Knowledge

Prerequisites

End-to-End Autonomous Driving architectures (e.g., VAD, SparseDrive)
Transformer Decoder mechanisms (Query-Key-Value attention)
Hausdorff Distance (metric for measuring similarity between two point sets)

Key Terms

TPC: Trajectory Prediction Consistency—a new metric proposed in this paper to quantitatively measure the stability/alignment between predicted and historical trajectories.

Hausdorff Distance: A mathematical metric measuring the maximum distance from a point in one set to the nearest point in another set; used here to ensure the shape of the new trajectory matches the old one.

Turning-nuScenes: A curated validation set derived from the nuScenes dataset, specifically focusing on turning scenarios to rigorously test temporal consistency.

Bench2Drive: A closed-loop autonomous driving benchmark that evaluates whether the agent can successfully complete routes in a simulator.

SparseDrive: A state-of-the-art sparse perception and planning framework that serves as the base model for this paper.