_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
Offline RL: Reinforcement learning where the agent learns from a fixed, previously collected dataset without interacting with the environment during training
BC: Behavioral Cloning—a supervised learning approach that trains an agent to mimic the actions in the dataset
IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying out-of-distribution actions by treating the value function update as an expectile regression
PPO: Proximal Policy Optimization—an on-policy RL algorithm used here to generate the data for the datasets
SAC: Soft Actor-Critic—an off-policy RL algorithm used here to generate the 'warmstart' dataset
Compositional RL: RL approaches where tasks are decomposed into functional modules (e.g., 'pick', 'place', 'robot arm') that can be recombined to solve new tasks
Zero-shot Generalization: The ability of a model to solve a task it has never seen before during training, relying on knowledge transfer from related tasks
Warmstart: Data collected during the early stages of training (low success rate), simulating a scenario where limited online RL was performed
Medium-Replay: A dataset consisting of the replay buffer of an agent trained up to medium performance, containing a mix of poor and decent trajectories
CompoSuite: A simulated robotic manipulation benchmark consisting of 256 tasks created by composing robot arms, objects, obstacles, and objectives