_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MBRL: Model-Based Reinforcement Learning—learning a model of the environment's dynamics to simulate and optimize behavior.
DreamerV3: A state-of-the-art model-based RL algorithm that learns a latent world model to generate synthetic experience for policy training.
RSSM: Recurrent State-Space Model—a specific neural network architecture used in Dreamer to model dynamics using both deterministic and stochastic components.
Scaffolder: The proposed method that uses privileged sensors to improve world models, critics, and exploration for a target policy.
Privileged Information: Data available only during training (like ground-truth states or extra camera views) but not during deployment.
POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly observe the full state of the world.
Transdecoder: A neural component in Scaffolder that maps privileged latent states to predicted target observations, enabling the target policy to run inside the scaffolded world model.
S3 Suite: Sensory Scaffolding Suite—a new benchmark of 10 simulated robotic tasks designed to evaluate agents with limited test-time sensors.
TD-lambda: Temporal Difference lambda—a method for estimating the value of a state by combining rewards over multiple future steps.
Critic: A neural network that estimates the value (expected future reward) of a state or action.
World Model: A learned simulator that predicts future states and rewards given current states and actions.
Latent State: A compressed internal representation of the environment state learned by the neural network.