_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MARL: Multi-Agent Reinforcement Learning—learning where multiple agents interact in a shared environment
CTDE: Centralized Training for Decentralized Execution—training with access to global info (states, others' actions) but executing using only local observations
Dec-POMDP: Decentralized Partially Observable Markov Decision Process—a mathematical model for multi-agent coordination under uncertainty where agents share a reward but have local views
CTE: Centralized Training and Execution—a centralized controller makes decisions for all agents based on global info
DTE: Decentralized Training and Execution—agents learn and act independently without any centralized coordination phase
VDN: Value Decomposition Networks—a method where the joint Q-value is the sum of individual agent Q-values
QMIX: A method generalizing VDN by allowing the joint Q-value to be a non-linear (but monotonic) combination of individual Q-values
MADDPG: Multi-Agent Deep Deterministic Policy Gradient—an actor-critic method where a centralized critic takes joint actions/states to guide decentralized actors
COMA: Counterfactual Multi-Agent Policy Gradients—uses a centralized critic to estimate a counterfactual baseline (what if this agent acted differently?)
MAPPO: Multi-Agent Proximal Policy Optimization—an application of PPO to the multi-agent setting using a centralized value function
NEXP-complete: A complexity class (Nondeterministic Exponential Time) indicating the problem is extremely hard, effectively doubly exponential in the worst case
MMDP: Multi-agent MDP—a fully observable cooperative setting, simpler than Dec-POMDPs
Value Function Factorization: Decomposing the global team value function into individual agent utility functions to enable decentralized action selection