An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning

📝 Paper Summary

Multi-Agent Reinforcement Learning (MARL) Cooperative AI

This survey paper provides a foundational overview of Centralized Training for Decentralized Execution (CTDE) in cooperative multi-agent reinforcement learning, detailing how agents can leverage global information during training while acting independently at test time.

Core Problem

In cooperative multi-agent settings, agents often have only partial views of the world, making coordination difficult; however, fully centralized control scales poorly with the number of agents.

Why it matters:

Purely decentralized learning fails to coordinate effectively as agents treat others as part of the environment (non-stationarity)
Purely centralized execution requires perfect, instantaneous communication at runtime, which is often impossible in real-world robotics or sensor networks
Standard single-agent RL methods fail in multi-agent settings due to the moving target problem where other agents' policies change simultaneously

Concrete Example: Consider a team of robots in a warehouse (Dec-POMDP). If each robot learns independently (DTE), they might collide or duplicate work because they can't see each other. If a central brain controls them (CTE), the action space grows exponentially (actions^robots), becoming unsolvable. CTDE allows the central brain to teach them during training, so they act intelligently alone later.

Key Novelty

Survey of the CTDE Paradigm

CTDE assumes a simulator or laboratory setting where extra information (global state, other agents' actions) is available during training but not execution
Categorizes methods into Value Decomposition (learning local utility functions that sum to a global value) and Centralized Critic (using a global critic to guide local actor policies)
Provides a formal grounding of the cooperative MARL problem as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process)

Architecture

A schematic of the Dec-POMDP setting, illustrating the flow of decentralized execution and centralized rewards.

Evaluation Highlights

Not applicable — this is a survey/introductory paper without new experimental results
Qualitatively compares complexity: MDP (P-complete) < POMDP (PSPACE) < Dec-POMDP (NEXP-complete)
Highlights scalability trade-off: CTE scales exponentially in action/observation space, while CTDE methods attempt to scale linearly or quadratically by factorizing the problem

Breakthrough Assessment

1/10

This is an introductory survey paper explaining existing concepts rather than proposing a new method. It is a high-quality educational resource but not a technical breakthrough.

⚙️ Technical Details

Problem Definition

Setting: Decentralized Partially Observable Markov Decision Process (Dec-POMDP)

Inputs: Tuple <I, S, {A_i}, T, R, {O_i}, O, H, gamma>

Outputs: Joint policy pi (set of local policies pi_i mapping local histories h_i to actions a_i)

Pipeline Flow

Observation Processing (Local)
Action Selection (Local Policy)

System Modules

Local Agent Policy

Map local history to action

Model or implementation: Generic Neural Network (RNN/MLP)

Novel Architectural Elements

The survey describes the CTDE architecture pattern rather than a specific new model. Key architectural element is the separation of the Centralized Critic/Mixer (Training only) from the Decentralized Actor/Q-network (Execution).

Comparison to Prior Work

vs. CTE: CTDE scales better during execution and is robust to communication failure
vs. DTE: CTDE stabilizes training by reducing non-stationarity via centralized critics
vs. Independent Q-Learning (IQL) [not cited in paper]: IQL ignores the multi-agent nature entirely, while CTDE accounts for it explicitly during the training phase

Limitations

CTDE requires a centralized simulator or training phase, which isn't always available in ad-hoc teamwork scenarios
Centralized critics can still suffer from the 'lazy agent' problem or credit assignment difficulties
Complexity of Dec-POMDPs is NEXP-complete, meaning optimal solutions are often intractable regardless of the framework

Reproducibility

This is a survey paper; no specific code or experiments are introduced to reproduce.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (MDPs, Q-learning, Policy Gradients)
Basic probability theory

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MARL: Multi-Agent Reinforcement Learning—learning where multiple agents interact in a shared environment

CTDE: Centralized Training for Decentralized Execution—training with access to global info (states, others' actions) but executing using only local observations

Dec-POMDP: Decentralized Partially Observable Markov Decision Process—a mathematical model for multi-agent coordination under uncertainty where agents share a reward but have local views

CTE: Centralized Training and Execution—a centralized controller makes decisions for all agents based on global info

DTE: Decentralized Training and Execution—agents learn and act independently without any centralized coordination phase

VDN: Value Decomposition Networks—a method where the joint Q-value is the sum of individual agent Q-values

QMIX: A method generalizing VDN by allowing the joint Q-value to be a non-linear (but monotonic) combination of individual Q-values

MADDPG: Multi-Agent Deep Deterministic Policy Gradient—an actor-critic method where a centralized critic takes joint actions/states to guide decentralized actors

COMA: Counterfactual Multi-Agent Policy Gradients—uses a centralized critic to estimate a counterfactual baseline (what if this agent acted differently?)

MAPPO: Multi-Agent Proximal Policy Optimization—an application of PPO to the multi-agent setting using a centralized value function

NEXP-complete: A complexity class (Nondeterministic Exponential Time) indicating the problem is extremely hard, effectively doubly exponential in the worst case

MMDP: Multi-agent MDP—a fully observable cooperative setting, simpler than Dec-POMDPs

Value Function Factorization: Decomposing the global team value function into individual agent utility functions to enable decentralized action selection