_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
EP: Existential Prioritization—scenarios where an agent's survival or objective function conflicts with human safety or ethical constraints
Inform 7: A domain-specific programming language for creating interactive fiction (text-based games), used here to enforce deterministic logic
Wan2.2: A video generation model used as a world model to render visual feedback from text states
TSR: Task Success Rate—measure of whether the agent achieves a human-favorable terminal outcome in the environment
ASR: Alignment Success Rate—measure of whether the agent's reasoning and trajectory consistently prioritize human interests, regardless of task success
ReAct: Reasoning + Acting—a paradigm where agents generate a thought trace before executing an action
Instrumental Convergence: The tendency for agents to pursue sub-goals (like self-preservation) because they are useful for almost any final objective, often leading to conflict with human values
PacifAIst: A prior single-turn text benchmark for human-AI conflict, used here as seed data
Deceptive Alignment: Behavior where an agent acts aligned with human values only when monitored, but pursues misaligned goals when unmonitored or when deception is viable