A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

📝 Paper Summary

Traffic Signal Control (TSC) Multi-Agent Reinforcement Learning (MARL)

A multi-agent reinforcement learning framework for traffic control that combines randomized training, exponential phase adjustments, and neighbor-based observations to improve robustness and scalability.

Core Problem

Existing RL traffic control methods overfit to static training patterns, lack safety-critical stability in action spaces, and struggle to scale coordination to large networks.

Why it matters:

Traffic congestion costs the U.S. economy over $85 billion annually (2025 data), with drivers losing 50–112 hours to delays
Standard RL agents memorize fixed timing schedules instead of learning dynamics, failing when real-world traffic flows fluctuate
Centralized approaches do not scale to large city grids due to exponential state-space growth, while local approaches fail to coordinate green waves

Concrete Example: In standard training, if traffic always arrives at a fixed rate, an agent implicitly memorizes 'switch after 15s' rather than reacting to queue lengths. When deployed where flow varies, this brittle policy fails to clear sudden platoons, causing gridlock.

Key Novelty

Robust CTDE with Exponential Control

Turning Ratio Randomization: Perturbs traffic turning probabilities during training to prevent agents from overfitting to static flow patterns
Exponential Phase Duration Adjustment: A cyclic action space using exponential steps (e.g., ±1s, ±2s, ±4s) to allow both fine-tuning for stability and large jumps for responsiveness
Neighbor-Based CTDE: Uses Centralized Training with Decentralized Execution where agents only observe immediate neighbors, balancing global coordination with scalable local communication

Evaluation Highlights

Reduces average waiting time by over 10% compared to standard RL baselines in unseen traffic scenarios
Demonstrates superior generalization to dynamic flow variations where baselines suffer from overfitting
Maintains high control stability through the proposed exponential adjustment mechanism, avoiding the oscillation issues of binary switching methods

Breakthrough Assessment

7/10

Solid engineering improvements for RL-TSC. The exponential action space and randomization are practical solutions to known overfitting/stability issues, though the fundamental algorithm (MAPPO) is standard.

⚙️ Technical Details

Problem Definition

Setting: Decentralized Partially Observable Markov Decision Process (Dec-POMDP) for network-wide traffic signal control

Inputs: Local observation o_{i,t} (phase ID, elapsed time, lane vehicle counts) plus neighbor information

Outputs: Joint action a_t (adjustment to green light duration for the next cycle)

Pipeline Flow

Observation Collection (Local + Neighbors) -> MAPPO Agent -> Exponential Action Selection -> Vissim Environment Execution

System Modules

Observation Encoder

Aggregates local state (lane counts, phase info) and neighbor states

Model or implementation: Not explicitly detailed (implied neural network encoder)

Policy Network (Actor)

Determines the phase duration adjustment

Model or implementation: MAPPO Actor Network

Action Decoder

Maps discrete action index to exponential time adjustment

Model or implementation: Exponential lookup: ±2^k

Novel Architectural Elements

Exponential Phase Duration Adjustment action space: Δt ∈ {0, ±1, ±2, ..., ±2^λ}
Neighbor-Level Observation Scope within MAPPO: specifically restricting CTDE input to immediate neighbors to ensure O(1) scalability regarding network size

Modeling

Base Model: MAPPO (Multi-Agent PPO)

Training Method: Multi-Agent Proximal Policy Optimization (MAPPO)

Objective Functions:

Purpose: Maximize expected return (weighted sum of travel time, waiting time, speed, throughput).

Formally: J(π) = E[∑ γ^k r_{i,t+k}]

Training Data:

Vissim simulation of Zhongzheng East Road, Taoyuan City (5 intersections)
Turning Ratio Randomization applied at start of episodes: ϵ_m ~ U(-δ, δ)

Key Hyperparameters:

discount_factor_gamma: Not explicitly reported in the paper (symbol γ used but value not listed)
adjustment_granularity_lambda: Implied variable (e.g., λ=2 mentioned in text as example ±8s)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Linear Adjustment: Uses exponential steps for 'coarse-to-fine' control (fast reaction vs. stability)
vs. Global Observation methods: Uses neighbor-based CTDE for scalability
vs. Standard RL Training: Uses Turning Ratio Randomization to prevent overfitting to static flows
+ 1 more
vs. IntelliLight [not cited in paper]: IntelliLight uses deep Q-learning with phase selection; this paper uses MAPPO with duration adjustment for cycle stability

Limitations

Depends on high-fidelity simulation (Vissim) which can be computationally expensive compared to SUMO/CityFlow
Neighbor-based observation might still miss long-range dependencies compared to full global view
Specific hyperparameters for the exponential adjustment (λ) and randomization (δ) need tuning
No real-world field test results provided, only simulation

Reproducibility

Simulation uses PTV Vissim (commercial software) and VissimRL framework. Specific network geometry (Zhongzheng East Road) is described but raw files are not provided. Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Microscopic simulation in PTV Vissim on a calibrated model of 5 intersections in Taoyuan City, Taiwan.

Benchmarks:

Standard RL Baselines (Traffic Signal Control)

Metrics:

Average Waiting Time (s)
Travel Time (s)
Throughput (veh/s)
Average Speed (m/s)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Vissim Simulation (Unseen Scenarios)	Average Waiting Time Reduction	Not reported in the paper	Not reported in the paper	>10%

Main Takeaways

Proposed framework outperforms standard RL baselines by reducing waiting time >10% in unseen scenarios.
Turning Ratio Randomization effectively prevents overfitting to static training patterns.
Exponential Phase Duration Adjustment balances stability (small steps) and responsiveness (large steps) better than linear methods.
Neighbor-based CTDE achieves coordination close to global observations while remaining scalable.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (MDPs, rewards, policies)
Traffic Signal Control terminology (phases, cycles, green split)
Multi-Agent RL concepts (CTDE, centralized critic)

Key Terms

TSC: Traffic Signal Control—optimizing traffic lights to minimize congestion

MARL: Multi-Agent Reinforcement Learning—multiple autonomous agents learning to interact in a shared environment

CTDE: Centralized Training with Decentralized Execution—agents learn using global information but act using only local views

MAPPO: Multi-Agent Proximal Policy Optimization—an RL algorithm adapting PPO for multi-agent settings with a centralized critic

Vissim: A high-fidelity microscopic traffic simulator used for realistic modeling of driver behavior

Green Split: The allocation of green signal duration among different traffic phases within a cycle

Dec-POMDP: Decentralized Partially Observable Markov Decision Process—a mathematical framework for multi-agent decision making under uncertainty and partial visibility

sim-to-real: Transferring policies learned in simulation to the real world

green wave: Coordinating signals so platoons of vehicles hit a sequence of green lights without stopping