_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
VLA: Vision-Language-Action—multimodal models that take visual and language inputs to generate robotic actions
SFT: Supervised Fine-Tuning—training a model on a dataset of expert demonstrations (human teleoperation)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs for the same input, eliminating the need for a critic model
LLM: Large Language Model
veRL: Volcano Engine Reinforcement Learning—a library for RL training of LLMs, which this paper extends for VLAs
Process Reward: A reward signal given at intermediate steps (e.g., 'distance to object'), often manually engineered
Outcome Reward: A sparse binary reward given only at the end of a task (Success=1, Failure=0)
Pushcut: A phenomenon where the RL-trained policy discovers novel, efficient manipulation behaviors (like pushing an object to cut it) not present in the SFT training data
Proprioceptive State: Internal sensing of the robot's own body, such as joint angles or end-effector position
Dynamic Sampling: A strategy during RL rollout where batches containing identical rewards (all success or all failure) are discarded to prevent vanishing gradients
Action Chunking: Predicting a sequence of future actions (a chunk) in one forward pass rather than a single step, used to improve temporal consistency
Sim-to-Real: Transferring a policy trained in a physics simulation to a physical robot in the real world