VLA: Vision-Language-Action model—a multimodal AI that takes visual and text inputs and outputs physical actions
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input, often used to align LLMs
Dr. GRPO: A variant of GRPO designed to mitigate 'difficulty bias' by removing the standard deviation normalization term from the advantage estimation
SFT: Supervised Fine-Tuning—training the model on labeled demonstrations before RL
difficulty bias: A phenomenon where standard RL algorithms prioritize learning from easy (low-variance) samples while ignoring harder (high-variance) samples where the model is unstable
PDM Score: Predictive Driving Model score—a composite metric for NAVSIM evaluating safety, comfort, and progress of predicted trajectories
RFS: Rated Feedback Score—a metric for WaymoE2E measuring the similarity of predicted trajectories to human preference labels
k-disc tokenization: A method to discretize continuous trajectories into a fixed vocabulary of cluster centers (tokens) for language model prediction