Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving

📝 Paper Summary

End-to-End Autonomous Driving Vision-Language Models (VLMs) for Control

Poutine adapts an off-the-shelf VLM for driving by predicting trajectories as text tokens, using large-scale vision-language-trajectory pre-training followed by lightweight reinforcement learning on human preferences.

Core Problem

Long-tail driving scenarios (e.g., rare accidents, construction) are safety-critical but scarce in training data, requiring high-level reasoning that standard planners and nominal-driving VLMs often lack.

Why it matters:

Rare long-tail events dominate operational risk in autonomous driving but account for less than 0.003% of daily driving data
Current VLM driving agents are mostly tested on nominal benchmarks (nuScenes) where semantic reasoning is less critical
Prior methods often require complex custom architectures (heads/tokenizers) that may hinder the VLM's native reasoning capabilities

Concrete Example: In a 'Spotlight' scenario (manually selected challenging cases), a standard planner might fail to infer the intent of a pedestrian or construction cone placement, whereas Poutine leverages VLM reasoning to generate a safe trajectory.

Key Novelty

Simple VLM-based Driving Policy with GRPO

Treats trajectory prediction purely as a next-token prediction task (text generation) using an unmodified off-the-shelf VLM, avoiding custom action heads
Combines large-scale self-supervised Vision-Language-Trajectory (VLT) pre-training with Group Relative Policy Optimization (GRPO), a lightweight RL method using human preferences

Architecture

The two-stage training pipeline (VLT Pre-training followed by RL Post-training) and the inference data flow

Evaluation Highlights

Achieves 7.99 Rater Feedback Score (RFS) on the Waymo Vision-Based End-to-End Driving Test Set, securing 1st place in the 2025 Challenge
Outperforms the Waymo Baseline by +0.46 RFS points on the test set
Demonstrates zero-shot cross-continent transfer: a model trained solely on Japanese data (CoVLA) achieves 7.74 RFS on US data (Waymo)

Breakthrough Assessment

9/10

Achieves SOTA on a rigorous long-tail benchmark using a surprisingly simple recipe (standard VLM + GRPO), proving complex custom architectures are unnecessary for strong driving performance.

⚙️ Technical Details

Problem Definition

Setting: Open-loop end-to-end trajectory planning

Inputs: Task description, navigation command (intent), past ego trajectory, and sequence of historical multi-view RGB images

Outputs: Predicted future trajectory of the ego vehicle (represented as text tokens)

Pipeline Flow

Input Processing (Images, Intent, History)
VLM Backbone (Qwen2.5-VL)
Text Generation (Reasoning + Trajectory)

System Modules

VLM Backbone

Jointly processes visual and textual inputs to generate future trajectory tokens

Model or implementation: Qwen2.5-VL 3B Instruct

Modeling

Base Model: Qwen2.5-VL 3B Instruct

Training Method: Two-stage: (1) Supervised VLT Pre-training, (2) GRPO Reinforcement Learning

Objective Functions:

Purpose: Pre-training (SFT) to learn base driving capabilities.

Formally: Standard next-token prediction loss over vision, language, and trajectory tokens.
Purpose: RL Post-training to align with human preferences.

Formally: GRPO objective maximizing group-relative advantage A_i based on reward r, with KL divergence penalty.
Purpose: Reward function for RL.

Formally: r = r_drive (L2 error or preference score) + r_format (1 if valid format, 0 otherwise).

Training Data:

Stage 1: 83 hours nominal driving (CoVLA, Japan) + 11 hours long-tail (Waymo, US)
Stage 2: <500 preference-labeled frames from Waymo validation set
Auto-generated captions using Qwen2.5-VL 72B for all training data

Key Hyperparameters:

learning_rate_sft: 1e-5
learning_rate_rl: 1e-6
batch_size_sft: 64 (CoVLA), 16 (Waymo)
+ 5 more
batch_size_rl: 32
beta_kl: 0.04
sampling_temperature: 0.9
rollouts_per_sample: 8
rl_steps: 2000

Compute: Training: 4x NVIDIA A100 GPUs (24h for CoVLA, 10h for Waymo, 12h for RL)

Comparison to Prior Work

vs. AutoVLA: Poutine generates text tokens for trajectory vs. AutoVLA's custom action tokens; Poutine achieves higher RFS
vs. EMMA: Poutine uses GRPO post-training on preferences vs. EMMA's pure SFT approach
vs. TrajHF: Poutine bootstraps from internet-scale VLM + VLT pre-training vs. TrajHF's smaller scale initialization
+ 1 more
vs. Drive-R1 [not cited in paper]: Drive-R1 applies GRPO to nuScenes (nominal), Poutine applies it to Waymo (long-tail)

Limitations

RL fine-tuning improves overall score but degrades performance in specific categories (Spotlight, Construction, Multi-lane)
Inference requires generating text tokens, which may be slower than dedicated regression heads (though precise latency not reported)
Limited context: uses only 3 front-facing cameras due to compute constraints, omitting side/rear views

Reproducibility

No public code or model weights provided in the paper. Dataset (CoVLA) is public; Waymo data requires challenge access. Prompts for annotation and pre-training are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Open-loop evaluation on curated long-tail scenarios

Benchmarks:

Waymo Vision-Based End-to-End Driving (WOD-E2E) (Long-tail trajectory planning)

Metrics:

Rater Feedback Score (RFS)
Average Displacement Error (ADE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Poutine dominates the Waymo challenge leaderboard, significantly outperforming baselines and establishing SOTA performance.
WOD-E2E Test Set	RFS	7.53	7.99	+0.46
WOD-E2E Test Set	RFS	7.56	7.99	+0.43
WOD-E2E Validation Set	RFS	7.91	8.12	+0.21
WOD-E2E Validation Set	RFS	7.53	7.74	+0.21

Experiment Figures

Learning curves (RFS vs Steps) for GRPO fine-tuning on validation data

Main Takeaways

Lightweight RL (GRPO) with very few labels (<500) significantly boosts performance over strong supervised baselines
Text-based trajectory representation works better than custom action tokenizers, likely preserving VLM reasoning capabilities
Geographic zero-shot transfer (Japan -> US) is possible with VLM pre-training, despite left- vs. right-hand driving differences

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Reinforcement Learning (RL)
End-to-End Autonomous Driving

Key Terms

VLT: Vision-Language-Trajectory—a multimodal pre-training approach where the model learns to predict text descriptions and trajectories from visual inputs

GRPO: Group Relative Policy Optimization—an RL algorithm that updates a policy by comparing a group of outputs generated for the same input and reinforcing the best ones relative to the group average

RFS: Rater Feedback Score—a metric evaluating driving quality based on alignment with human-preferred trajectories

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

ADE: Average Displacement Error—the average L2 distance between the predicted trajectory and the ground truth over a specific time horizon

nominal driving: Standard, everyday driving conditions (lane keeping, simple turns) as opposed to rare 'long-tail' events

SFT: Supervised Fine-Tuning—training the model on labeled data using standard cross-entropy loss

long-tail scenarios: Rare, edge-case driving situations (e.g., debris on road, erratic pedestrians) that are difficult to model