Solving Motion Planning Tasks with a Scalable Generative Model

📝 Paper Summary

Autonomous Driving Simulation World Models

GUMP is a generative world model that uses a key-value tokenization strategy and partial autoregressive decoding to unify scene generation, simulation, and planning in autonomous driving.

Core Problem

Autonomous driving systems lack scalable, realistic simulators; existing learning-based models struggle with long-horizon consistency in closed-loop settings, while rule-based simulators fail to capture complex human-like interactivity.

Why it matters:

Scalability limits: AD systems struggle to adapt to unseen environments without extensive engineering for failure scenarios.
Simulation gap: Open-loop predictions cannot adapt to out-of-distribution states encountered during real-world interactions.
Safety validation: Developing safe policies requires high-fidelity, reactive environments that can generate diverse and rare traffic scenarios.

Concrete Example: In a complex intersection, an open-loop model might predict a car continues straight regardless of the ego-vehicle's actions. In a closed-loop scenario, if the ego-vehicle aggressively merges, a realistic simulator (like GUMP) should make the other car yield or swerve, rather than colliding blindly as an open-loop prediction would.

Key Novelty

Generative Unified Model for Motion Planning (GUMP)

Key-Value Tokenizer: Treats agents as 'keys' (ID + category) and their physical properties as 'values' (quantized states), enabling flexible querying and dynamic agent management.
Partial-Autoregressive Acceleration: Converts intra-frame dependencies to non-autoregressive (NAR) parallel decoding to speed up inference without losing inter-frame causal consistency.
Unified Downstream Support: A single foundation model acts as a scene generator, a reactive simulator for testing, a planner, and an RL training environment.

Architecture

The overall architecture of GUMP, detailing the flow from static/dynamic inputs to the final trajectory decoding.

Evaluation Highlights

Achieves state-of-the-art performance on simulation realism and scene generation benchmarks (Waymo and nuPlan).
Planner module based on the world model outperforms prior arts in planning benchmarks.
Significantly improves inference and training speed via partial-Autoregressive mode while maintaining generative capability.

Breakthrough Assessment

8/10

Proposes a highly versatile architecture that effectively merges simulation, generation, and planning. The key-value tokenization and partial-AR speedup address critical bottlenecks in deploying transformers for real-time AD simulation.

⚙️ Technical Details

Problem Definition

Setting: Closed-loop trajectory prediction and generation in multi-agent traffic environments.

Inputs: Context c (static map, language prompts) and historical dynamic states s_t of agents.

Outputs: Future sequences of agent states (position, heading, velocity, size) generated autoregressively.

Pipeline Flow

Input Processing (Static Map Encoding + Dynamic State Tokenization)
Fusion & Modeling (Multimodal Causal Transformer)
Decoding (GRU-based State Recovery)

System Modules

Static Raster Autoencoder (Input Processing)

Encodes static information (maps, routes, static obstacles) into latent features.

Model or implementation: 2D Convolutional Encoder

Dynamic Tokenizer (Input Processing)

Converts agent data into discrete tokens using a Key-Value pair strategy.

Model or implementation: Quantization logic

Multimodal Causal Transformer (MCT)

Core generative model that learns interaction dynamics and predicts future tokens.

Model or implementation: GPT-2 architecture with Gated Cross Attention (GCA)

Auto-regressive Decoder

Decodes latent features back into continuous state trajectories.

Model or implementation: Stacked GRU layers

Novel Architectural Elements

Key-Value paired tokenization (Control Token as Key, State Token as Value) enabling specific object querying.
Intra-frame Non-Autoregressive (NAR) conversion module that parallelizes decoding within a single time step to speed up simulation.
Prediction chunking mechanism with temporal aggregation (weighted average with decay) to stabilize AR rollouts.

Modeling

Base Model: GPT-2 style Transformer

Training Method: Reinforcement Learning (for the RL Engine component)

Objective Functions:

Purpose: Optimize policy to maximize accumulated discounted return.

Formally: Maximize sum of gamma^t * r(s_t, a_t).
Purpose: Calculate reward based on safety and progress metrics.

Formally: R = w * Theta (critical metrics) + w * Phi (general metrics).

Training Data:

Waymo Open Motion Dataset
nuPlan Dataset

Key Hyperparameters:

computational_requirements: Not reported in the paper

Compute: Inference accelerated via partial-AR mode (intra-frame parallelization).

Comparison to Prior Work

vs. IDM: GUMP is data-driven and captures complex interactions beyond simple car-following rules.
vs. Open-loop Forecasting: GUMP enables reactive closed-loop simulation where agents respond to the ego-vehicle.
vs. Standard Transformers: GUMP uses Key-Value tokenization and partial-AR decoding for specific efficiency in handling dynamic agent counts.

Limitations

Dependency on quantization quality for state representation.
RL training stability relies on the realism of the world model (simulation-to-reality gap).
Computational cost of full-AR generation is high (mitigated by partial-AR mode).

Reproducibility

Code: https://github.com/HorizonRobotics/GUMP/

Source code is publicly available at https://github.com/HorizonRobotics/GUMP/. The paper relies on Waymo and nuPlan datasets which are public. Specific model hyperparameters (layer counts, hidden dimensions) are not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Closed-loop simulation and planning evaluation on real-world driving datasets.

Benchmarks:

Waymo Open Motion Dataset (Motion Simulation & Planning)
nuPlan (Motion Planning)

Metrics:

Simulation Realism (Sim Agents metric)
Safety (Collision rates, infractions)
Progress / Driving Direction
Human-likeness
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The model achieves state-of-the-art performance on both simulation realism and scene generation benchmarks compared to baselines.
The planning engine built on GUMP outperforms prior arts in planning benchmarks, validating the utility of the generative world model for downstream tasks.
The partial-AR design (Intra-frame NAR) provides significant speedups for training and inference without sacrificing generative quality compared to Full-AR.
The framework effectively serves multiple roles: data generator (via prompts), simulator (reactive environment), and planner (via rollouts).

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures (specifically GPT-style)
Reinforcement Learning (MDP formulation)
Autonomous Driving Motion Forecasting

Key Terms

AR: Autoregressive—generating data one step at a time, where each step depends on previous ones.

NAR: Non-Autoregressive—generating multiple data points in parallel to increase speed.

World Model: A learned internal representation of the environment's dynamics, allowing an agent to simulate futures.

Tokenization: Converting continuous data (like vehicle coordinates) into discrete tokens that a language model can process.

GCA: Gated Cross Attention—a mechanism to selectively fuse information from different modalities (e.g., map data and agent tracks).

Rollout: Simulating a sequence of future steps starting from a current state to estimate outcomes.

MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker.

SAC: Soft Actor-Critic—an off-policy reinforcement learning algorithm that optimizes a stochastic policy to maximize expected reward and entropy.