Generalized Trajectory Scoring for End-to-end Multimodal Planning

📝 Paper Summary

End-to-end autonomous driving Trajectory planning Multi-modal planning

GTRS improves autonomous driving planning by training a scorer on a massive, diverse set of static trajectories and then applying it to fine-grained dynamic proposals during inference.

Core Problem

Existing trajectory scorers struggle to generalize: fixed vocabularies lack fine-grained precision for specific scenes, while dynamic proposals are too narrow to capture broad driving distributions during training.

Why it matters:

Fixed vocabularies cannot adapt to complex, safety-critical situations requiring precise maneuvers.
Scorers trained only on small sets of dynamic proposals fail to generalize to unseen trajectory types or environments.
Robust planning requires handling both the breadth of general driving scenarios and the depth of specific, fine-grained interactions.

Concrete Example: A fixed vocabulary planner might fail a complex lane change because no pre-defined trajectory fits the gap perfectly. Conversely, a dynamic planner trained on limited data might generate a valid path but score it incorrectly due to distribution shift in a new city.

Key Novelty

Generalized Trajectory Scoring (GTRS)

Trains a scorer on a 'super-dense' vocabulary (16k trajectories) with dropout to force learning of robust, generalizable features rather than overfitting to specific patterns.
Combines this robust scorer with a diffusion-based generator at inference time, merging the stability of static priors with the precision of dynamic proposals.
Uses sensor augmentation (rotations) and refinement training (distilling teacher scores) to handle out-of-domain perceptual shifts.

Architecture

Inference-time integration of the system components.

Evaluation Highlights

Achieves 49.4 EPDMS on the Navsim v2 Challenge (Navhard split), winning the challenge.
Approaches the performance of PDM-Closed, a privileged planner using ground-truth data, despite relying on sub-optimal synthetic sensor inputs.
Zero-shot generalization: The scorer trained on static trajectories outperforms a random selection baseline on dynamic proposals by +11.1 EPDMS.

Breakthrough Assessment

8/10

Significant practical advance winning a major challenge. Cleverly decouples training (breadth via static vocabulary) from inference (precision via dynamic generation), solving a key generalization bottleneck.

⚙️ Technical Details

Problem Definition

Setting: End-to-end trajectory scoring where the model must select the best trajectory from a candidate set given raw sensor inputs.

Inputs: Multi-view camera images (frontal, front-left, front-right).

Outputs: A score for each trajectory in a combined set of static vocabulary and dynamically generated proposals.

Pipeline Flow

Image Backbone (extracts features)
Trajectory Generator (Diffusion Policy creates dynamic proposals)
Trajectory Scorer (evaluates combined static + dynamic set)

System Modules

Image Backbone

Extract visual features from multi-view images

Model or implementation: ViT-L or EVA-ViT-L

Diffusion Trajectory Generator (DP)

Generate diverse, fine-grained trajectory proposals conditioned on BEV features

Model or implementation: Diffusion Transformer with BEV Encoder

Generalized Vocabulary Scorer (GTRS-Dense) (Scoring)

Score a large set of trajectories (static + dynamic) to select the best one

Model or implementation: Transformer Decoder

Refinement Module (Scoring)

Refine scores for top-k candidates to distinguish subtle differences

Model or implementation: Transformer Decoder

Novel Architectural Elements

Decoupled inference architecture: Scorer is trained on static dense vocabulary but infers on a union of static and dynamic proposals.
Refinement decoder trained with self-distillation (EMA teacher) to sharpen discrimination between similar trajectories.

Modeling

Base Model: ViT-L / EVA-ViT-L (Backbones)

Training Method: Supervised learning with specialized regularization and distillation

Objective Functions:

Purpose: Train the scorer to rank trajectories correctly.

Formally: Standard scoring loss (likely classification or regression to ground truth, though exact loss form not detailed in text).
Purpose: Refine scores using soft targets from a teacher.

Formally: Self-distillation where target is a clipped interpolation between ground truth and teacher prediction: y_tilde = clip(y_hat + s_teacher, y_hat - delta, y_hat + delta).

Training Data:

Navsim v2 dataset (Navtrain split)
Super-dense vocabulary of 16,384 trajectories for GTRS-Dense training

Key Hyperparameters:

learning_rate: 2e-4
weight_decay: 0.0
batch_size: 528
+ 4 more
epochs: 20 (Scorer), 50 (Generator)
denoising_steps: 100
vocab_size_training: 16,384
vocab_size_inference_static: 8,192

Compute: 24 NVIDIA A100 GPUs

Comparison to Prior Work

vs. Hydra-MDP: GTRS uses a super-dense vocabulary with dropout and integrates dynamic diffusion proposals at inference.
vs. TransFuser: GTRS is multi-modal and separates generation from scoring.
vs. PDM-Closed: GTRS operates on raw sensors but approaches PDM-Closed's performance through robust generalization strategies.

Limitations

Relies on heavy compute for training (24 A100s).
Inference complexity is higher due to diffusion generation step (100 steps).
Performance on synthetic data is still impacted by artifacts compared to real-world data.

Reproducibility

Code: https://github.com/NVlabs/GTRS

Code will be available at https://github.com/NVlabs/GTRS. Uses Navsim dataset. Training relies on significant compute (24 A100s).

📊 Experiments & Results

Evaluation Setup

Open-loop planning evaluation on the Navsim benchmark.

Benchmarks:

Navsim v2 (Navhard split) (Autonomous driving planning)

Metrics:

EPDMS (Extended PDM Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies demonstrating the effectiveness of vocabulary generalization strategies.
Navhard	EPDMS	25.6	36.7	+11.1
Navhard	EPDMS	Not reported in the paper	43.4	Not reported in the paper
Main benchmark results comparing GTRS variants to baselines.
Navhard	EPDMS	40.6	43.4	+2.8
Navhard	EPDMS	Not reported in the paper	45.3	Not reported in the paper
Navhard	EPDMS	Not reported in the paper	49.4	Not reported in the paper

Main Takeaways

Training on a super-dense static vocabulary with dropout enables robust scoring of unseen dynamic trajectories.
Combining static and dynamic proposals at inference yields better performance than either alone.
Sensor augmentation (rotation) and refinement training significantly improve robustness to out-of-domain data.
Ensembling multiple GTRS variants allows sensor-based planning to rival privileged methods using ground-truth data.

📚 Prerequisite Knowledge

Prerequisites

End-to-end autonomous driving architectures
Diffusion models for trajectory generation
Transformer-based encoders/decoders

Key Terms

EPDMS: Extended PDM Score—a metric for evaluating driving performance that aggregates multiple rule-based safety and comfort metrics

PDM-Closed: A privileged planner baseline that uses ground-truth perception data rather than raw sensor inputs, representing an upper bound for sensor-based methods

BEV: Bird's Eye View—a top-down representation of the driving scene, often constructed from multiple camera views

Diffusion Policy: A generative model approach that creates trajectories by iteratively denoising random noise conditioned on scene features

DDPM: Denoising Diffusion Probabilistic Models—a specific class of generative models used here for trajectory generation

EMA: Exponential Moving Average—a technique where model weights are updated as a moving average of past weights, often used to create stable teacher models

3DGS: 3D Gaussian Splatting—a rendering technique used to generate synthetic sensor data for the benchmark