VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

📝 Paper Summary

End-to-end Autonomous Driving Probabilistic Planning Vectorized Scene Representation

VADv2 replaces deterministic trajectory regression with probabilistic planning by modeling the action space as a distribution over a large vocabulary of feasible trajectories, selecting actions via sampling.

Core Problem

Deterministic planning models assume a fixed relationship between environment and action, failing to capture the multi-modal, non-convex nature of human driving behavior.

Why it matters:

Human driving is inherently stochastic; identical scenarios can yield valid but distinct maneuvers (e.g., yielding vs. overtaking), which deterministic regression averages into unsafe 'in-between' actions.
Deterministic models tend to collapse to the dominant mode (e.g., just going straight) seen in training data, ignoring rarer but necessary maneuvers.
Regression-based planning struggles with non-convex solution spaces, often outputting invalid trajectories that violate physical or safety constraints.

Concrete Example: When interacting with an oncoming vehicle, a driver might yield or overtake. A deterministic model might average these valid options and output a collision course. VADv2 models the distribution, allowing it to sample one valid mode (yield or overtake) rather than an invalid average.

Key Novelty

Probabilistic Planning with Vectorized Vocabulary

Discretizes the continuous planning space into a large 'vocabulary' of 4,096 physically feasible trajectories sampled from expert demonstrations.
Models planning as a probabilistic field: given environmental tokens, the network predicts a probability distribution over this entire trajectory vocabulary.
Selects actions by sampling from the predicted distribution, allowing the system to handle multi-modal scenarios and non-deterministic human behaviors.

Architecture

The overall framework of VADv2, detailing the flow from multi-view images to probabilistic action sampling.

Evaluation Highlights

Achieves state-of-the-art closed-loop performance on the CARLA Town05 benchmark, significantly outperforming all existing methods.
Runs stably in a fully end-to-end manner using only camera sensors, even without rule-based wrappers.
Demonstrates ability to handle complex scenarios like lane changes and interactions with reduced collision rates compared to deterministic baselines.

Breakthrough Assessment

8/10

Significant shift from deterministic regression to probabilistic vocabulary-based planning in end-to-end driving. Solves the 'mode averaging' safety issue inherent in regression, achieving SOTA on CARLA.

⚙️ Technical Details

Problem Definition

Setting: End-to-End Autonomous Driving (Sensor-to-Control)

Inputs: Multi-view image sequences from surround cameras

Outputs: Control signals (steer, throttle, brake) derived from a sampled trajectory

Pipeline Flow

Scene Encoder: Images → Environmental Tokens (Map, Agent, Traffic Element, Image)
Navigation Encoding: Nav command + Ego State → Embeddings
Probabilistic Planner: Tokens + Embeddings → Action Distribution
Controller: Sampled Action → PID Control → Steering/Throttle/Brake

System Modules

Scene Encoder

Transform raw multi-view images into high-level instance tokens

Model or implementation: Based on VAD/MapTR architectures

Action Encoder (Planning)

Encode candidate trajectories into high-dimensional embeddings

Model or implementation: MLP with positional encoding (Gamma function)

Transformer Decoder (Planning)

Interaction between action embeddings and environmental context

Model or implementation: Cascaded Transformer Decoder

Novel Architectural Elements

Probabilistic Field Planner: Instead of a regression head, VADv2 uses a field function to score a large, fixed vocabulary of trajectories based on scene context.
Discretized Action Space via Vocabulary: Pre-computing 4,096 feasible trajectories to convert the continuous planning problem into a discrete classification/ranking problem.

Modeling

Base Model: Custom Transformer-based architecture (building on VAD, MapTR)

Training Method: Supervised Learning with distribution matching

Objective Functions:

Purpose: Match predicted trajectory distribution to expert behavior.

Formally: KL Divergence loss between predicted distribution and data distribution.
Purpose: Penalize unsafe trajectories using prior knowledge.

Formally: Conflict Loss (assigning negative weights/penalties to vocabulary items that collide with agents or boundaries).
Purpose: Ensure intermediate representations capture scene structure.

Formally: Scene Token Loss (L1 and Focal loss for map, agent, and traffic element tasks).

Key Hyperparameters:

planning_vocabulary_size: 4096 (default)
vocabulary_sampling_method: Furthest Trajectory Sampling

Compute: Not reported in the paper

Comparison to Prior Work

vs. VAD/UniAD: Uses probabilistic planning over a discretized vocabulary instead of deterministic trajectory regression.
vs. DriveGPT4: Models trajectory distribution directly rather than using an LLM to predict tokenized actions or text.
vs. ST-P3: Uses vectorized scene representations and probabilistic outputs rather than raster-based perception and regression.

Limitations

Computational cost of evaluating 4,096 trajectory candidates in the probabilistic field is not analyzed.
Reliance on a fixed vocabulary might limit generalization to extreme edge cases not covered by the sampled trajectories.
No specific hardware latency or FPS metrics reported for the inference pipeline.

Reproducibility

Code: https://hgao-cv.github.io/VADv2

Project page available at https://hgao-cv.github.io/VADv2. Code availability is stated as 'publicly available' in the summary, but the text primarily references the project page with demos. Exact training hardware and time are not reported.

📊 Experiments & Results

Evaluation Setup

Closed-loop simulation

Benchmarks:

CARLA Town05 Long (Closed-loop urban driving)
CARLA Town05 Short (Closed-loop urban driving)

Metrics:

Driving Score (DS)
Route Completion (RC)
Infraction Score (IS)
Collision Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VADv2 achieves state-of-the-art performance on the CARLA Town05 Long benchmark, significantly surpassing previous methods in Driving Score and safety metrics.
CARLA Town05 Long	Driving Score (DS)	76.1	85.2	+9.1
CARLA Town05 Long	Route Completion (RC)	93.0	98.5	+5.5
CARLA Town05 Long	Infraction Score (IS)	0.82	0.88	+0.06
VADv2 also outperforms baselines on the Town05 Short benchmark.
CARLA Town05 Short	Driving Score (DS)	90.0	94.8	+4.8

Experiment Figures

Comparison of Deterministic vs. Probabilistic Planning in multi-modal scenarios (Overtaking and Interaction).

Main Takeaways

Probabilistic planning significantly reduces collision rates compared to deterministic regression methods (implied by higher Infraction Score and Driving Score).
The method is stable enough to run end-to-end without rule-based safety wrappers, although wrappers can be added for robustness.
The use of a large planning vocabulary (N=4096) allows for covering diverse driving modes (yield vs. overtake) that regression methods average out.

📚 Prerequisite Knowledge

Prerequisites

End-to-end autonomous driving architectures
Transformer-based perception (BEVFormer, MapTR)
Probabilistic modeling / Distribution learning

Key Terms

Vectorized Representation: Encoding scene elements (lanes, agents) as sparse sets of points or vectors rather than dense raster images.

Probabilistic Field: A function mapping a continuous space (here, trajectory coordinates) to a probability density, similar to how NeRF maps coordinates to radiance.

Planning Vocabulary: A discrete set of N representative trajectories sampled from expert demonstrations, used to approximate the continuous action space.

Furthest Trajectory Sampling: A sampling method used to select diverse trajectories from a dataset to form the planning vocabulary, ensuring coverage of the action space.

KL Divergence: A statistical distance measure used here to minimize the difference between the predicted action distribution and the ground truth distribution.

MinFDE: Minimum Final Displacement Error—a metric measuring the distance between the best predicted trajectory endpoint and the ground truth endpoint.