Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, Yadan Luo
School of Computer Science and Technology, Beijing Jiaotong University,
Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, University of Macau,
The University of Queensland
Computer Vision and Pattern Recognition
(2025)
MMAgentBenchmark
📝 Paper Summary
End-to-End Autonomous DrivingTrajectory Planning
MomAD stabilizes autonomous driving trajectories by explicitly integrating momentum—aligning current plans with historical paths and aggregating past perception context to prevent erratic control shifts.
Core Problem
Current end-to-end planners rely on one-shot predictions from single-frame perception, leading to temporal inconsistency (jittery control), vulnerability to occlusions, and lack of long-horizon stability.
Why it matters:
Inconsistent predictions cause 'shaking' (sudden directional shifts), leading to uncomfortable and unsafe driving experiences.
One-shot multi-modal predictions are susceptible to noise and temporary occlusions, potentially causing the planner to switch abruptly between conflicting trajectories.
Without temporal coherence, vehicles may fail to maintain steady progress during complex maneuvers like turns.
Concrete Example:In a turning scenario, a standard planner might predict a smooth left turn at frame t, but at frame t+1—due to a momentary detection error—it might suddenly predict a straight path. This discontinuity forces the vehicle to jerk or 'shake' the wheel, increasing collision risk.
Key Novelty
Momentum-Aware Driving (MomAD)
Introduces 'Trajectory Momentum': Uses Hausdorff Distance to select the candidate trajectory that best preserves the shape and topology of the previous time step's path, ensuring smooth motion.
Introduces 'Perception Momentum': A module that cross-attends current planning queries with historical ones, allowing the model to 'remember' past context (like occluded agents) and refine predictions.
Architecture
Overview of the MomAD framework, detailing the Sparse Perception module, the Trajectory Prediction loop, and the two core momentum components: TTM and MPI.
Evaluation Highlights
Reduces collision rate by 26% compared to SparseDrive on the curated Turning-nuScenes validation set (6-second prediction horizon).
Improves Trajectory Prediction Consistency (TPC) by 33.45% (0.97m) over SparseDrive on Turning-nuScenes, demonstrating significantly more stable planning.
Achieves up to 16.3% improvement in success rate on the closed-loop Bench2Drive benchmark.
Breakthrough Assessment
8/10
Addresses the critical and often overlooked problem of temporal consistency in end-to-end driving. The explicit modeling of 'momentum' offers a physics-grounded solution that significantly boosts stability and safety.
⚙️ Technical Details
Problem Definition
Setting: End-to-end trajectory planning where the model predicts future waypoints given multi-view sensor inputs, while ensuring temporal consistency with history.
Inputs: Multi-view camera images
Outputs: Planned trajectory set (waypoints) and associated scores
Pipeline Flow
Sparse Perception → Instance Feature Extraction
Robust Instance Denoising (Transformer Block)
Candidate Trajectory Generation
Topological Trajectory Matching (Selection)
Momentum Planning Interactor (Query Mixing)
Planning Head (Final Prediction)
System Modules
Sparse Perception (Perception)
Extracts instance features for road agents and map elements from multi-view images
Model or implementation: SparseDrive-based Encoder
Robust Instance Denoising (Perception)
Filters noise from instance features to improve robustness against detection errors
Model or implementation: Lightweight Encoder-Decoder Transformer
Topological Trajectory Matching (TTM)
Selects the best candidate trajectory that aligns topologically with the historical path
Model or implementation: Hausdorff Distance Calculator
Momentum Planning Interactor (MPI)
Enriches the selected planning query with historical context via cross-attention
Model or implementation: Long-horizon Query Mixer (LSTM + Cross-Attention)
Planning Head
Generates the final refined trajectory based on the momentum-enriched query
Model or implementation: Transformer Decoder / Regression Head
Novel Architectural Elements
Topological Trajectory Matching (TTM): A selection module inserted into the planning loop that uses geometric topology (Hausdorff distance) rather than just probability scores to pick the base trajectory.
Momentum Planning Interactor (MPI): A recursive query-mixing architecture that cross-attends current queries with LSTM-processed historical queries to inject 'perception momentum'.
Modeling
Base Model: SparseDrive
Training Method: End-to-end training with denoising perturbations
Training Data:
nuScenes dataset
Turning-nuScenes (curated subset)
Compute: Not reported in the paper
Comparison to Prior Work
vs. SparseDrive: MomAD adds explicit temporal consistency (momentum) via TTM and MPI modules, whereas SparseDrive is one-shot [Baseline]
vs. UniAD/VAD: MomAD uses multi-modal probabilistic planning with momentum refinement, unlike the deterministic approaches in early UniAD/VAD
Limitations
Relies on the quality of upstream sparse perception; if detection fails completely, momentum can only compensate partially.
Hausdorff Distance (metric for measuring similarity between two point sets)
Key Terms
TPC: Trajectory Prediction Consistency—a new metric proposed in this paper to quantitatively measure the stability/alignment between predicted and historical trajectories.
Hausdorff Distance: A mathematical metric measuring the maximum distance from a point in one set to the nearest point in another set; used here to ensure the shape of the new trajectory matches the old one.
Turning-nuScenes: A curated validation set derived from the nuScenes dataset, specifically focusing on turning scenarios to rigorously test temporal consistency.
Bench2Drive: A closed-loop autonomous driving benchmark that evaluates whether the agent can successfully complete routes in a simulator.
SparseDrive: A state-of-the-art sparse perception and planning framework that serves as the base model for this paper.