Bootstrapping Reinforcement Learning with Imitation for Vision-Based Agile Flight

📝 Paper Summary

Vision-based Quadrotor Control Agile Drone Racing

A three-stage framework that trains a teacher policy with privileged states, distills it into a vision-based student, and fine-tunes the student with adaptive RL to achieve agile flight using only RGB images.

Core Problem

Training agile vision-based drone policies from scratch with RL is sample-inefficient and computationally demanding due to high-dimensional inputs, while Imitation Learning (IL) is limited by the expert's performance and struggles with covariate shift.

Why it matters:

Current autonomous racing drones often rely on external state estimation or IMUs, whereas human pilots fly using only visual cues
Pure RL from pixels fails to learn effectively within reasonable sample budgets due to the difficulty of exploration in high-dimensional spaces
Standard Imitation Learning cannot surpass the expert demonstrator and often fails when the drone drifts from the training distribution

Concrete Example: In the 'SplitS' maneuver, a drone may face frames without visible gate corners. A student policy trained via IL often fails here because it lacks the context to infer actions from partial information, crashing where an RL-refined policy would succeed.

Key Novelty

Teacher-Student Distillation with Adaptive RL Fine-tuning

Train a privileged expert using state-based RL, then distill it into a vision-based student policy via DAgger (Imitation Learning)
Use the pre-trained vision student as the initialization for a second round of RL, using a performance-adaptive update rule to prevent catastrophic forgetting during fine-tuning

Architecture

The three-phase training framework: (1) Teacher Training with RL on states, (2) Student Distillation via Imitation Learning on vision, (3) Student Fine-tuning via RL with Asymmetric Critic.

Evaluation Highlights

Achieved 100% Success Rate on the 'SplitS' track using gate corner inputs, whereas RL from scratch failed completely (0%)
Outperformed standard DAgger (Imitation Learning) by reducing lap times from 6.89s to 6.27s on the SplitS track
Demonstrated real-world transfer where RL fine-tuning improved success rate from 40% (DAgger) to 100% and reduced gate passing error by ~50%

Breakthrough Assessment

8/10

Significantly advances vision-based agile flight by successfully bridging the gap between sample-efficient IL and high-performance RL, achieving results where baselines fail completely.

⚙️ Technical Details

Problem Definition

Setting: Minimizing time to navigate a sequence of gates using only egocentric visual history

Inputs: Sequence of visual observations (RGB images or gate corners) o_{t-H+1:t}

Outputs: Control commands: Collective Thrust (c) and Bodyrates (ω_x, ω_y, ω_z)

Pipeline Flow

Visual Encoder (ResNet/TCN)
Policy Network (MLP)
Control Output

System Modules

Visual Encoder

Encodes visual history into a feature vector

Model or implementation: TCN (Temporal Convolutional Network) processing history of length H

Policy Network

Maps visual features to control actions

Model or implementation: 2-layer MLP

Performance-Adaptive Tuner

Adjusts learning rates and clip ranges dynamically based on rollout performance during Phase III

Model or implementation: Algorithm 1 (Adaptive update logic)

Novel Architectural Elements

Performance-Adaptive Online Fine-Tuning loop: dynamically scales learning rate and PPO clip range based on the policy's reward improvement to prevent catastrophic forgetting

Modeling

Base Model: Custom TCN + MLP architecture (Student), ResNet50 (Visual Feature Extractor)

Training Method: Phase I: PPO (Teacher), Phase II: DAgger (Student Distillation), Phase III: Adaptive PPO Fine-tuning

Objective Functions:

Purpose: Train teacher with dense rewards.

Formally: PPO objective maximizing sum of rewards (progress, perception, actuation penalty, collision)
Purpose: Distill teacher knowledge to student.

Formally: MSE Loss L_A = ||π_teacher(s_t) - π_student(o_t)||^2
Purpose: Fine-tune student with RL.

Formally: PPO objective with Asymmetric Critic (privileged states for value function)

Training Data:

Simulated race tracks (SplitS, Figure 8, Kidney)
Domain randomization: gate scales, pixel noise, 10% corner dropout

Key Hyperparameters:

history_length: 32 (timesteps)
total_samples: 10M (budget)
simulation_freq: Not explicitly reported in the paper
+ 1 more
control_freq: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. PIRLNav: Uses DAgger initialization instead of BC; uses adaptive learning rates to prevent forgetting
vs. DAgger: Adds a third phase of RL fine-tuning to exceed expert performance
vs. RL from scratch: Bootstraps with an imitation-learned policy to overcome exploration capability

Limitations

Requires a privileged expert (oracle) which may not be available for all tasks
Relies on accurate simulation for the initial training phases (sim-to-real gap)
Adaptive tuning heuristic introduces additional hyperparameters (thresholds for rate adjustment)

Reproducibility

Code: https://rpg.ifi.uzh.ch/bootstrap-rl-with-il/index.html

Code URL provided in paper leads to project page but code is not yet released. Simulation environments and BEM model references provided. Hyperparameters for history length and sample budget are specified.

📊 Experiments & Results

Evaluation Setup

Autonomous drone racing in simulation (Flightmare with BEM dynamics) and real-world transfer

Benchmarks:

SplitS Track (Agile Racing)
Figure 8 Track (Agile Racing)
Kidney Track (Agile Racing)

Metrics:

Success Rate (SR)
Lap Time (LT)
Mean Gate Passing Error (MGE)
Statistical methodology: 100 evaluation runs per policy, repeated 5 times with different random seeds. Means reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results on SplitS track (Corner Input) showing the proposed method outperforms baselines in reliability and speed.
SplitS Track (Sim)	Success Rate (SR)	0.58	1.00	+0.42
SplitS Track (Sim)	Lap Time (LT)	6.89	6.27	-0.62
Real-world experiments demonstrating sim-to-real transfer capabilities on the SplitS track.
SplitS Track (Real)	Success Rate (SR)	0.40	1.00	+0.60
SplitS Track (Real)	Mean Gate Error (MGE)	0.36	0.19	-0.17

Experiment Figures

Learning curves (Reward vs. Timesteps) comparing the proposed method against baselines (RL from scratch, Vanilla Finetuning).

Performance vs. Pre-training Data Ratio. Shows deployment reward for different splits of the 10M sample budget between IL and RL.

Main Takeaways

RL from scratch on high-dimensional visual inputs (pixels or corners) completely fails (0% success) within the 10M sample budget.
The proposed three-stage pipeline (RL-Teacher -> IL-Student -> RL-Finetune) is necessary to achieve high success rates and low lap times.
Adaptive fine-tuning prevents catastrophic forgetting, allowing the policy to improve beyond the teacher's capabilities, unlike standard BC or DAgger.
The method demonstrates strong robustness to disturbances (blackouts, wind) compared to pure Imitation Learning baselines.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Imitation Learning (Behavior Cloning, DAgger)
Quadrotor Dynamics

Key Terms

DAgger: Dataset Aggregation—an iterative imitation learning algorithm where the student policy collects its own data, which is then labeled by the expert

PPO: Proximal Policy Gradient—a reinforcement learning algorithm that optimizes policies using a clipped objective function to ensure stable updates

TCN: Temporal Convolutional Network—a neural network architecture that uses 1D convolutions over a time sequence to capture temporal history

Asymmetric Critic: An RL architecture where the critic (value estimator) has access to privileged information (e.g., exact states) that the actor (policy) does not see

Covariate Shift: A situation where the distribution of input data during testing differs from training (e.g., a drone drifting to positions not seen in expert demonstrations)

Sim-to-Real: Transferring a policy trained in simulation to a physical robot

BEM model: Blade Element Momentum theory—a physics model used for accurate aerodynamic simulation of propellers

Privileged Information: Exact state data (position, velocity) available in simulation but not to the vision-based robot during deployment

Catastrophic Forgetting: A phenomenon where a neural network abruptly loses previously learned knowledge when trained on new data