_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
EFM: Embodied Foundation Model—a large pretrained model (like a VLM) fine-tuned to output robot actions
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, human demonstrations) before applying reinforcement learning
Steps-to-go: A predicted scalar value estimating the number of timesteps remaining until a goal is achieved from the current state
REINFORCE: A basic policy gradient algorithm in Reinforcement Learning that updates policies based on the return (total reward) of a trajectory
Monte Carlo returns: The actual sum of rewards received from a specific time step until the end of an episode, used to estimate the value of a state-action pair
Behavioral Cloning: A supervised learning approach where a robot learns a policy by strictly mimicking expert (human) demonstrations
PaLI: Pathways Language and Image model—a large vision-language model architecture used as the backbone for the robot policy
RT-2: Robotic Transformer 2—a specific method for turning VLMs into robot policies by tokenizing actions as text
On-policy: An RL setting where the data used for training comes from the current version of the policy being optimized, rather than historical data
Deadly Triad: The instability caused in RL when combining Function Approximation, Bootstrapping (using estimates to update estimates), and Off-Policy learning