Coordinated Strategies in Realistic Air Combat by Hierarchical Multi-Agent Reinforcement Learning

📝 Paper Summary

Multi-Agent Reinforcement Learning (MARL) Hierarchical Reinforcement Learning Air Combat Simulation

A hierarchical multi-agent framework adapts Simple Policy Optimization (SPO) to train heterogeneous aircraft teams in realistic dogfights, using low-level maneuver policies guided by high-level tactical commanders.

Core Problem

Realistic air combat involves high-dimensional, nonlinear flight dynamics and partial observability, making it difficult for standard 'flat' RL policies to simultaneously master precise control and high-level strategy.

Why it matters:

End-to-end RL often fails to converge in complex physics environments like JSBSim due to the difficulty of learning control and tactics simultaneously
Existing air combat simulations often simplify physics (3-DOF) or use homogeneous agents, limiting applicability to real-world defense scenarios involving diverse aircraft types
Standard algorithms like PPO (Proximal Policy Optimization) may be less sample-efficient or stable than newer approaches like SPO in these high-stakes control domains

Concrete Example: In a 10-vs-10 engagement, a non-hierarchical (flat) agent attempts to map raw observations directly to throttle/stick inputs. It fails to learn effective strategies, achieving a 0% win rate, whereas the hierarchical approach decomposes the task into tactical decisions (e.g., 'Engage') and execution.

Key Novelty

Hierarchical Heterogeneous Multi-Agent RL (HHMARL) with MA-SPO

Decomposes decision-making into two levels: High-level commanders issue discrete tactical orders (Attack, Engage, Defend), while low-level policies execute continuous flight maneuvers
Adapts Simple Policy Optimization (SPO) to the multi-agent domain (MA-SPO) using Centralized Training and Decentralized Execution (CTDE)
Integrates a curriculum learning pipeline with league-play to progressively train heterogeneous agents (F16 and A4 aircraft) from basic maneuvers to complex team coordination

Architecture

The hierarchical decision-making process for F16 and A4 agents.

Evaluation Highlights

Achieved 90% win rate in 3-vs-3 scenarios using MA-SPO, outperforming MA-PPO (88%) and completely dominating non-hierarchical baselines (0%)
Maintained >80% win rate in large-scale 10-vs-10 battles (83% for MA-SPO), demonstrating scalability where flat policies failed entirely
MA-SPO low-level policies achieved higher mean rewards and faster convergence than PPO and SAC (Soft Actor-Critic) in 1-vs-1 dogfight training

Breakthrough Assessment

7/10

Strong engineering application combining realistic physics (JSBSim) with a novel algorithm adaptation (MA-SPO). While the hierarchy concept is established, the successful integration with heterogeneous agents and league-play in a high-fidelity sim is significant.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Semi-Markov Game (POSMG) for the high-level hierarchy and POMG for low-level control

Inputs: Aircraft state (position, velocity, orientation angles), relative distance to opponents, and weapon engagement zones

Outputs: High-level: Discrete option selection (0=Defend, 1=Engage, 2=Attack); Low-level: Continuous control inputs (aileron, elevator, rudder, throttle, shoot)

Pipeline Flow

Input Processing (Observations)
High-Level Commander (Selects Option)
Low-Level Controller (Executes Maneuvers)
Physics Integration (JSBSim)

System Modules

High-Level Commander

Strategic decision maker

Model or implementation: Actor-Critic Network (200-unit layers + Attention)

Low-Level Controller

Maneuver execution

Model or implementation: Actor-Critic Network (Separate weights per option type)

JSBSim Environment

Physics Engine

Model or implementation: F16 and A4 aerodynamic models

Novel Architectural Elements

Heterogeneous Hierarchical Structure: Shared commander policies per aircraft type (F16 vs A4) controlling type-specific low-level maneuver policies

Modeling

Base Model: Custom Actor-Critic Networks (Linear layers + Attention module)

Training Method: Multi-Agent Simple Policy Optimization (MA-SPO)

Objective Functions:

Purpose: Optimize policy within trust region using all samples.

Formally: L_s = -E [ (r_t(θ) A_t - |A_t|^2 / 2ε * (r_t(θ)-1)^2) ]
Purpose: Minimize value estimation error.

Formally: L_v = E[ (V_φ(s_t) - R_t)^2 ]
Purpose: Encourage exploration via entropy.

Formally: L_e = -H(π_θ)

Training Data:

Generated via self-play and league-play simulation in JSBSim
Curriculum Levels: L1 (Random targets), L2 (Active targets), L3 (Self-play), L4 (League-play)

Key Hyperparameters:

learning_rate: 1e-4 decaying to 1e-5
discount_factor_gamma: 0.995
spo_ratio_epsilon: 0.25
+ 5 more
value_coefficient: 0.9
entropy_coefficient: 0.05 decaying to 0
batch_size: 6000 (low-level), 3000 (high-level)
mini_batch_size: 256
simulation_frequency: 100Hz

Compute: 32-core 3.6 GHz CPU; Training time ~10-30 hours per policy level

Comparison to Prior Work

vs. MA-PPO: Uses SPO loss (no clipping, uses adaptive KL-like penalty via probability ratio quadratic) for potentially better stability
vs. Lockheed Martin: Introduces heterogeneous agents (F16/A4) and integrates SPO instead of standard algorithms
vs. Flat PPO/SPO: Uses hierarchical temporal abstraction to solve realistic physics tasks that flat policies fail to learn
+ 1 more
vs. LightZero [not cited in paper]: Comparison to MCTS-based planning methods is mentioned as future work

Limitations

Fixed low-level controllers may restrict adaptability in highly dynamic or novel scenarios not covered by the primitive options
Focus restricted to close-range cannon dogfights (Within Visual Range); excludes long-range missile dynamics
Training relies purely on CPU (32-core); no GPU acceleration details reported for the RL training

Reproducibility

Code: https://github.com/IDSIA-papers/HHMARL_AirCombat

Code is publicly available at github.com/IDSIA-papers/HHMARL_AirCombat. The environment relies on JSBSim (open source). High-level inference results in Table I are averaged over 1000 episodes.

📊 Experiments & Results

Evaluation Setup

3D air combat simulation using JSBSim physics with heterogeneous teams (F16 and A4)

Benchmarks:

Custom JSBSim Environment (10v10) (Large-scale team deathmatch) [New]
Custom JSBSim Environment (5v5, 3v3) (Small/Medium-scale team deathmatch) [New]

Metrics:

Win Rate (%)
Loss Rate (%)
Draw Rate (%)
Statistical methodology: Results averaged over 1000 episodes. No confidence intervals reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
3-vs-3 Combat	Win Rate	88%	90%	+2%
10-vs-10 Combat	Win Rate	80%	83%	+3%
10-vs-10 Combat	Win Rate	0%	83%	+83%
1-vs-1 Dogfight (F16)	Win Rate against L3 opponent	15%	64%	+49% (Win vs Loss margin)

Experiment Figures

Training curves (Mean Reward vs Training Samples) for low-level policies (Engage, Attack, Defend) using SPO, PPO, and SAC.

Training curves for the High-Level Commander policy comparing MA-SPO, MA-PPO, and FC-SPO.

Main Takeaways

Hierarchical decomposition is strictly necessary for this task; non-hierarchical (Flat) policies failed to learn any effective strategy (0% win rate)
MA-SPO consistently outperforms MA-PPO across all team sizes (3v3 to 10v10), with the gap widening slightly as complexity increases
Low-level maneuver policies trained with SPO achieved higher rewards than those trained with PPO or SAC
Strategic behavior emerged: MA-SPO agents favored 'Engage' (72%) and 'Defend' (7%) options in large battles, acting more cautiously than MA-PPO agents

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients, Actor-Critic)
Multi-Agent Systems (Centralized Training Decentralized Execution)
Flight Dynamics (6-DOF simulation concepts)

Key Terms

JSBSim: an open-source, non-linear flight dynamics model that simulates realistic aircraft physics (6 degrees of freedom)

SPO: Simple Policy Optimization—a recent policy gradient algorithm that uses a probability ratio in the loss but relies on KL-divergence regularization rather than PPO's clipping

MA-SPO: Multi-Agent Simple Policy Optimization—the authors' adaptation of SPO to multi-agent settings using centralized critics

CTDE: Centralized Training Decentralized Execution—a paradigm where agents train with access to global info (critic) but act using only local info (actor)

HMARL: Hierarchical Multi-Agent Reinforcement Learning—structuring agents into layers, typically a manager (high-level) and workers (low-level)

League-Play: A training mechanism where agents play against a mixed population of past versions or diverse strategies to prevent overfitting to a single opponent

POSMG: Partially Observable Semi-Markov Game—a game theoretic model where actions (options) can last for variable amounts of time

PPO: Proximal Policy Optimization—a standard RL algorithm that prevents large policy updates via clipping

SAC: Soft Actor-Critic—an off-policy RL algorithm that maximizes entropy alongside expected return

WEZ: Weapon Engagement Zone—the geometric area relative to an aircraft where its weapons can effectively hit a target