Towards General-Purpose Model-Free Reinforcement Learning

📝 Paper Summary

General-Purpose Reinforcement Learning Representation Learning for RL Model-Free RL with Model-Based Objectives

MR.Q achieves general-purpose performance across diverse RL benchmarks using a single set of hyperparameters by leveraging model-based auxiliary losses to learn representations for a lightweight model-free agent.

Core Problem

Deep RL algorithms are typically highly specialized, requiring distinct hyperparameters and architectures for different domains (e.g., discrete vs. continuous), while general model-based methods are computationally expensive and complex.

Why it matters:

Current 'general' agents like DreamerV3 require massive models and slow planning procedures, limiting real-time applicability
The fragmentation of RL algorithms (e.g., DQN for Atari, TD3 for control) hinders the development of universal decision-making systems
Practitioners must perform extensive, domain-specific tuning to get RL algorithms to work on new tasks

Concrete Example: Standard algorithms like Rainbow (Atari) and TD3 (MuJoCo) share almost no hyperparameter values (e.g., learning rates differ by orders of magnitude: 6.25e-5 vs 1e-3). MR.Q solves both benchmarks with a single configuration.

Key Novelty

Model-based Representations for Q-learning (MR.Q)

Decouples representation learning from policy learning: uses auxiliary model-based losses (predicting reward, dynamics, termination) to shape a shared latent embedding
Replaces the expensive planning of model-based RL with a standard, lightweight model-free critic and actor that operate on these pre-learned embeddings
Optimizes for an approximately linear relationship between the learned features and the value function, enabling stable learning across vastly different observation and action spaces

Evaluation Highlights

~8x higher Evaluation FPS (1.9k) on Gym HalfCheetah compared to DreamerV3 (236) while maintaining competitive performance
Achieves strong performance on Atari with ~4.4M parameters, compared to 187.3M for DreamerV3 (a ~40x reduction in model size)
Outperforms state-of-the-art generalist baselines (TD-MPC2, DreamerV3) on both DeepMind Control (DMC) Proprioceptive and Visual benchmarks

Breakthrough Assessment

8/10

Successfully challenges the dominance of complex model-based methods for generalist RL. It demonstrates that lightweight model-free agents can be general-purpose if the representation is grounded in dynamics.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) with either continuous or discrete action spaces and vector or pixel observation spaces

Inputs: State s (pixels or vector)

Outputs: Action a (continuous vector or discrete logits)

Pipeline Flow

State Encoder (Input → z_s)
State-Action Encoder (z_s + a → z_sa)
MDP Predictor (z_sa → Reward, Next State, Done)
Policy Head (z_s → Action)
Value Head (z_sa → Q-value)

System Modules

State Encoder (Representation Learning)

Map raw observations (pixels/vector) to latent state embedding

Model or implementation: CNN (pixels) or MLP (vector)

State-Action Encoder (Representation Learning)

Combine state embedding and action into a unified feature vector

Model or implementation: MLP

MDP Predictor

Predict dynamics to shape the representation (Model-Based Objective)

Model or implementation: Linear Layer

Policy & Value

Standard Actor-Critic control

Model or implementation: MLP (Policy: Tanh/Gumbel-Softmax; Value: Double Q)

Novel Architectural Elements

Use of a linear MDP predictor to enforce an approximately linear relationship between features and value, inspired by linear RL theory
Integration of categorical reward prediction and unrolled dynamics losses into a model-free actor-critic pipeline without using the model for planning

Modeling

Base Model: Custom Encoder (CNN/MLP) + MLP Heads

Training Method: Off-policy Actor-Critic with Model-Based Auxiliary Losses

Objective Functions:

Purpose: Shape representation via dynamics prediction.

Formally: L_Encoder = λ_R * CrossEntropy(r_pred, r) + λ_D * MSE(z_s'_pred, z_s'_target) + λ_T * MSE(d_pred, d)
Purpose: Learn value function (TD3-style).

Formally: L_Value = Huber(Q(z_sa), r + γ * Q_target)
Purpose: Optimize policy.

Formally: L_Policy = -Q(z_sa_pi) + Regularization

Key Hyperparameters:

optimizer: Adam
learning_rate: 1e-4
batch_size: 256
+ 5 more
discount_factor: 0.99
target_update_frequency: 1000 steps
unroll_horizon_HEnc: 5
multi_step_return_HQ: 3 or 10 (benchmark dependent, but paper claims single set? Paper says 'single set', check Appendix. Appendix D says H_Q=3 generally, 10 for DMC visual)
reward_scaling: Mean absolute reward (adaptive)

Compute: Single GPU. Training is ~3.5x faster than TD-MPC2 and ~2.7x faster than DreamerV3 on Gym.

Comparison to Prior Work

vs. DreamerV3: MR.Q is model-free at inference (no planning), uses 40x fewer parameters, and runs significantly faster
vs. TD7: MR.Q uses dynamics/reward auxiliary losses and eliminates the raw state input to the policy, achieving better generality across visual/discrete tasks
vs. TD-MPC2: MR.Q does not use Model Predictive Control (MPC) for action selection, reducing inference latency

Limitations

No positive transfer observed between benchmarks (e.g., Gym success doesn't guarantee DMC success)
Does not address hard exploration or non-Markovian environments
Slightly underperforms specialized state-of-the-art (TD7) on specific continuous control benchmarks (Gym)

Reproducibility

Code: https://github.com/facebookresearch/MRQ

Code is publicly available at https://github.com/facebookresearch/MRQ. The paper emphasizes using a single set of hyperparameters across benchmarks, which simplifies reproduction compared to highly tuned baselines.

📊 Experiments & Results

Evaluation Setup

Evaluated on 4 standard RL benchmarks with a total of 118 environments, using a fixed hyperparameter configuration.

Benchmarks:

Gym Locomotion (Continuous control (state vectors))
DMC Proprioceptive (Robotics control (state vectors))
DMC Visual (Robotics control (pixels))
Atari 100k (evaluated at 10M frames) (Discrete arcade games (pixels))

Metrics:

Total Reward
Human-Normalized Score
Training FPS
Evaluation FPS
Statistical methodology: 95% stratified bootstrap confidence intervals over 10 seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Gym (HalfCheetah-v4)	Training FPS	14	49	+35
Gym (HalfCheetah-v4)	Evaluation FPS	236	1900	+1664
Atari (Any)	Parameter Count (Millions)	187.3	4.4	-182.9
Atari - 1M	Human-Normalized Score Impact	0.0	-1.35	-1.35
Gym - Locomotion	TD3-Normalized Score Impact	0.0	-1.17	-1.17

Experiment Figures

Aggregate performance bar charts across 4 benchmarks and efficiency metrics (FPS/Params).

Main Takeaways

Model-based representations can replace explicit planning: MR.Q matches or beats planning-based agents (TD-MPC2) on DMC benchmarks without runtime search.
Generalization across modalities: The same algorithm handles pixels (Atari/DMC Visual) and state vectors (Gym/DMC Proprio) effectively.
Parameter efficiency: MR.Q achieves competitive generalist performance with <5M parameters, whereas DreamerV3 requires >180M for similar generality.
Design choices matter differently per domain: 'MSE reward loss' helps slightly in Gym (+0.10) but hurts significantly in Atari (-0.79), highlighting the difficulty of finding truly universal hyperparameters.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (MDPs, Value Functions)
Difference between Model-Based and Model-Free RL
Q-learning and Actor-Critic architectures

Key Terms

Model-Based RL: RL methods that learn a model of the environment's dynamics (transitions and rewards) and often use it for planning

Model-Free RL: RL methods that learn a policy or value function directly from experience without explicitly modeling the environment's physics

TD3: Twin Delayed DDPG—a standard model-free algorithm for continuous control that uses two critics to reduce overestimation bias

DreamerV3: A state-of-the-art general-purpose model-based RL algorithm that learns a world model from pixels

TD-MPC2: A general-purpose model-based algorithm that uses temporal difference learning for model predictive control

Bisimulation: A mathematical concept where two states are considered equivalent if they have the same immediate reward and transition to equivalent next states

Two-hot encoding: A categorical representation of a scalar value (like reward) using probability mass distributed between the two nearest discrete bins

Huber loss: A loss function that is quadratic for small errors and linear for large errors, less sensitive to outliers than Mean Squared Error

LAP: Loss-Adjusted Prioritized Experience Replay—a method to sample training data based on the magnitude of the TD error

EMA: Exponential Moving Average—a technique to update target network parameters slowly over time to stabilize training