Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight

📝 Paper Summary

Autonomous Drone Racing Vision-Based Control Model-Based Reinforcement Learning

Dream to Fly utilizes DreamerV3 to learn agile drone racing policies directly from raw camera pixels to control commands without explicit state estimation or intermediate representations.

Core Problem

Existing vision-based drone racing methods rely on simplified intermediate representations (like gate masks) or extensive imitation learning because model-free RL is too sample-inefficient to learn directly from high-dimensional raw pixels.

Why it matters:

Discarding background information via intermediate representations limits navigation capabilities when gates aren't visible
Bridging the gap to human-level piloting requires processing raw visual cues (texture, horizon) directly
Model-free RL methods like PPO struggle to converge on high-dimensional pixel inputs in reasonable timeframes

Concrete Example: Previous state-of-the-art methods preprocess the camera feed into a binary mask showing only gates. If the drone faces a wall with no gate in view, the binary mask is blank, discarding texture cues like the floor or horizon that a human (or this method) could use for stabilization.

Key Novelty

End-to-End Model-Based RL for Agile Flight

Learns a world model (transition dynamics) in a compact latent space directly from raw RGB images, allowing the agent to imagine future trajectories
Trains the control policy entirely within this learned imagination rather than requiring millions of real-world interactions
Eliminates the need for 'perception-aware' reward shaping; the agent naturally learns to look at gates to minimize uncertainty in its world model

Architecture

The training loop of the DreamerV3 agent involving World Model learning and Actor-Critic learning.

Evaluation Highlights

First autonomous agent to fly a quadrotor using unique pixel-to-command mapping without intermediate representations or state estimation
Achieves 100% success rate in simulation gate traversal, whereas the PPO baseline fails completely (0% success)
Successfully deploys zero-shot to the real world (with domain randomization) achieving agile flight up to 1.5 m/s

Breakthrough Assessment

9/10

Achieves a long-standing goal in robotics: agile flight from raw pixels without explicit state estimation. The total failure of strong baselines (PPO) highlights the difficulty of the task and the significance of the solution.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) where state is raw image history and action is continuous control commands

Inputs: Raw RGB images x_k from onboard camera (normalized to [0,1])

Outputs: 4D action vector a_k = [collective thrust, body_rate_x, body_rate_y, body_rate_z]

Pipeline Flow

World Model Learning: Encoder → RSSM → Decoder/Predictor
Policy Learning: Latent State → Actor → Imagined Trajectory → Critic
Deployment: Camera → Encoder → RSSM → Actor → Control Commands

System Modules

Encoder

Compress high-dimensional raw images into a stochastic latent representation

Model or implementation: Convolutional Neural Network (CNN)

Recurrent Sequence Model (RSSM)

Predict the evolution of the latent state based on previous states and actions

Model or implementation: Recurrent Neural Network (GRU-based)

Actor

Generate control actions based on the current latent state

Model or implementation: Multi-Layer Perceptron (MLP)

Novel Architectural Elements

Direct mapping from RSSM latent states to CTBR (Collective Thrust Body Rates) commands for agile flight, bypassing position/velocity abstractions
Integration of DreamerV3 architecture into a real-time quadrotor control loop

Modeling

Base Model: DreamerV3 (Small configuration)

Training Method: Model-Based Reinforcement Learning (DreamerV3)

Objective Functions:

Purpose: Train the world model to reconstruct observations and predict rewards.

Formally: L_WM = E[ -ln p(x_t|s_t) - ln p(r_t|s_t) - ln p(c_t|s_t) + KL(p(z_t|h_t)||q(z_t|h_t, x_t)) ]
Purpose: Train the actor to maximize imagined returns.

Formally: Maximize E[ sum(gamma^t * r_t) ] using learned value estimates
Purpose: Train the critic to estimate value of states.

Formally: Minimize regression loss between predicted value and lambda-return targets

Training Data:

Simulation environment: Flightmare
Randomly generated tracks with 7 gates
Domain randomization on textures, lighting, and physics parameters for Sim2Real

Key Hyperparameters:

batch_size: 16
batch_length: 64
learning_rate: 1e-4
+ 3 more
buffer_size: 1e6
discount_factor: 0.997
imagination_horizon: 15

Compute: Single NVIDIA RTX 2080 Ti or RTX 3090 GPU. Training time approx 4-6 hours for 2M steps.

Comparison to Prior Work

vs. PPO: Uses learned world model for sample efficiency vs. pure trial-and-error [PPO fails completely on pixels]
vs. Intermediate Rep: Learns from raw RGB vs. simplified binary masks, retaining background context
vs. Imitation Learning: Learns from scratch via RL vs. requiring expert pilot data

Limitations

Computational overhead of running the world model (RSSM) onboard is higher than simple CNN policies
Sim-to-real gap still requires extensive domain randomization
Current real-world speeds (1.5 m/s) are lower than state-based expert systems

Reproducibility

Code: https://github.com/uzh-rpg/dream_to_fly

Code is publicly available at https://github.com/uzh-rpg/dream_to_fly. Flightmare simulator is open source. Specific real-world drone hardware (Agilicious) is custom but documented in prior work.

📊 Experiments & Results

Evaluation Setup

Drone racing through a sequence of 7 gates in simulation (Flightmare) and real-world (Agilicious quadrotor)

Benchmarks:

Flightmare Simulation (Gate traversal / Racing)
Real-World Flight (Gate traversal / Racing) [New]

Metrics:

Success Rate (SR)
Lap Time
Crash Rate
Statistical methodology: Results averaged over 5 seeds in simulation

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation comparisons showing the inability of model-free RL (PPO) to learn from pixels compared to the proposed model-based approach.
Flightmare Simulation	Success Rate	0	1.0	+1.0
Flightmare Simulation	Success Rate	1.0	1.0	0.0
Ablation on observation types confirming raw pixels are viable compared to simplified masks.
Flightmare Simulation	Lap Time (s)	5.5	5.8	+0.3

Experiment Figures

Learning curves (Success Rate vs. Environment Steps) for DreamerV3 (Pixels), PPO (Pixels), DreamerV3 (State), and PPO (State).

Saliency maps of the agent's visual attention during flight.

Main Takeaways

Model-free RL (PPO) completely fails to learn the racing task directly from pixels, validating the need for model-based approaches.
The learned policy exhibits emergent 'active perception' behaviors, orienting the camera toward gates without explicit reward shaping.
Zero-shot sim-to-real transfer is successful, enabling the drone to fly through gates in the real world using only onboard camera processing.
Intermediate representations (like masks) simplify learning but are not strictly necessary with DreamerV3, which can handle raw RGB complexity.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy Gradients)
Model-Based RL (World Models, latent dynamics)
Quadrotor dynamics and control (CTBR - Collective Thrust and Body Rates)

Key Terms

DreamerV3: A model-based reinforcement learning algorithm that learns a world model from sensory inputs and trains a policy using imagined trajectories within that model

RSSM: Recurrent State Space Model—a neural network architecture used to model the dynamics of the environment by predicting future states

CTBR: Collective Thrust and Body Rates—a low-level control interface used by expert pilots and this paper, controlling total thrust and rotational velocities directly

World Model: A learned internal simulation of the environment's dynamics, allowing the agent to predict the consequences of its actions

PPO: Proximal Policy Optimization—a popular model-free reinforcement learning algorithm used as a baseline here

MBRL: Model-Based Reinforcement Learning—RL methods that learn a model of the environment to improve sample efficiency

Zero-shot transfer: Evaluating a model in a new environment (real world) without any additional training after being trained in a different environment (simulation)