_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
SMART: Self-supervised Multi-task pretrAining with contRol Transformer—the proposed framework.
CT: Control Transformer—the specific transformer architecture used in SMART that processes observation-action sequences.
DMC: DeepMind Control Suite—a standard benchmark for continuous control physics tasks.
RTG: Return-to-Go—the sum of future rewards from a specific timestep, often used as a condition for policies in offline RL.
BC: Behavior Cloning—an Imitation Learning method where the agent learns to mimic the expert's actions given observations.
IL: Imitation Learning—learning a policy from demonstrations.
POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly observe the full state.
Causal Attention: An attention mechanism where a token can only attend to previous tokens in the sequence (preserving time order).
Inverse Dynamics: Predicting the action taken between two states/observations.
ACL: Action Contrastive Learning—a baseline method using a modified BERT with contrastive loss.
DT: Decision Transformer—a baseline that models RL as a sequence modeling problem, typically conditioning on returns.
Momentum Encoder: A copy of a network updated via exponential moving average, used to provide stable targets for training (similar to MoCo).