SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

📝 Paper Summary

Network Architecture for RL Scaling Laws in RL

SimBa is a neural network architecture that enables scaling up Deep Reinforcement Learning models to millions of parameters without overfitting by explicitly enforcing a bias toward simpler functions.

Core Problem

Increasing the size of neural networks in Deep Reinforcement Learning (RL) typically leads to performance degradation due to overfitting, unlike in Computer Vision or NLP where larger models generally perform better.

Why it matters:

Current RL methods fail to leverage the scaling laws that have driven breakthroughs in other fields (e.g., LLMs), limiting the complexity of behaviors agents can learn.
Standard large networks (MLPs) fit noise in the RL training data rather than generalizable patterns, causing training to collapse as parameter count increases.
Existing scaling attempts often rely on computationally expensive components (like spectral normalization) or complex training protocols, making them inefficient.

Concrete Example: When scaling a Soft Actor-Critic (SAC) agent from 0.1M to 17M parameters on the 'Humanoid' task, the standard MLP architecture's performance drops significantly. In contrast, SimBa's performance improves as the model size increases.

Key Novelty

Architectural induction of Simplicity Bias

Uses a specific arrangement of normalization and residual connections to ensure the network prefers 'simple' (low-frequency) functions at initialization.
Maintains a direct linear path from input to output, adding non-linearity only via residual blocks, which encourages the model to ignore noise and focus on dominant features.
Does not require new loss functions or training algorithms; it is a drop-in architectural replacement for standard MLPs.

Evaluation Highlights

SimBa integrated into SAC matches or surpasses state-of-the-art methods across 51 tasks in DMC, MyoSuite, and HumanoidBench.
Scaling parameters from 0.1M to 17M consistently improves performance with SimBa, whereas standard MLPs degrade.
Achieves these results without computationally intensive components like self-supervised objectives, planning, or replay ratio scaling.

Breakthrough Assessment

8/10

Significantly addresses the long-standing 'scaling problem' in RL where bigger networks hurt performance. Simple architectural fix with broad applicability across multiple RL algorithms.

⚙️ Technical Details

Problem Definition

Setting: Continuous control reinforcement learning environments

Inputs: Observation vector o_t at timestep t

Outputs: Policy distribution (Actor) or Value estimate (Critic)

Pipeline Flow

Observation Normalization (RSNorm)
Linear Embedding
Stack of Residual Feedforward Blocks
Post-Layer Normalization
Output Head (Actor or Critic)

System Modules

Running Statistics Normalization (RSNorm)

Standardizes inputs by tracking running mean and variance to prevent high-variance features from dominating.

Model or implementation: Running mean/variance tracker

Residual Feedforward Block

Applies non-linear transformations while maintaining a direct linear pathway to induce simplicity bias.

Model or implementation: Inverted bottleneck MLP with LayerNorm (Pre-LN)

Post-Layer Normalization

Stabilizes activations before the final prediction head.

Model or implementation: LayerNorm

Novel Architectural Elements

Specific ordering of components (RSNorm -> Linear -> Pre-LN Residual Blocks -> Post-LN) designed to maximize 'Simplicity Bias Score' based on Fourier analysis.
Direct linear pathway from input to output allowing the network to default to simple functions at initialization.

Modeling

Base Model: Custom architecture (SimBa) replacing standard MLP in RL agents

Training Method: Integrated into various RL algorithms (SAC, PPO, TD-MPC2, METRA)

Training Data:

Standard RL environment interaction (online data collection)

Key Hyperparameters:

parameter_scaling_range: 0.1 million to 17 million parameters
hidden_dimension_expansion: 4x (inverted bottleneck in residual blocks)

Compute: Remains computationally efficient compared to methods using self-supervised objectives or planning.

Comparison to Prior Work

vs. MLP: SimBa scales effectively with parameter count, whereas MLP degrades.
vs. SpectralNet: SimBa relies on architectural simplicity bias (residual paths, normalization) rather than expensive spectral normalization constraints.
vs. BroNet: SimBa shows better scalability and simplicity bias scores.

Limitations

Evaluation is primarily on continuous control tasks (DMC, HumanoidBench); discrete control effectiveness is less explored.
The paper focuses on scaling model width/depth but does not extensively explore scaling with massive offline datasets (unlike LLMs).
Simplicity bias analysis relies on Fourier decomposition, which is a proxy metric for generalizability.

Reproducibility

Code: https://sonyresearch.github.io/simba

Code and videos available at https://sonyresearch.github.io/simba. The paper details the exact architectural components.

📊 Experiments & Results

Evaluation Setup

Continuous control tasks in varying difficulty

Benchmarks:

DeepMind Control Suite (DMC) (Standard continuous control)
MyoSuite (Musculoskeletal control (high dimensional))
HumanoidBench (Complex humanoid control)

Metrics:

Average Return
Sample Efficiency
Simplicity Bias Score (Fourier-based)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DMC Humanoid Tasks	Average Return	Decreased performance (qualitative from Figure 2b)	Increased performance (qualitative from Figure 2b)	Positive scaling trend
Fourier Analysis Grid	Simplicity Score	Lower (Visual in Figure 2a)	Higher (Visual in Figure 2a)	Significant increase
DMC-Hard	Average Return	Not reported as exact number	See notes	+550 points (combined components)

Main Takeaways

SimBa enables effective scaling of RL agents: performance improves monotonically with parameter count (up to 17M), unlike standard MLPs which collapse.
The architecture promotes a 'Simplicity Bias' quantified by Fourier analysis, meaning it learns smoother functions that generalize better.
Improvements are consistent across different underlying algorithms (SAC, PPO, TD-MPC2, METRA), suggesting SimBa is a universal architectural improvement for RL.
Achieves SOTA performance on complex benchmarks (HumanoidBench, MyoSuite) purely through architectural change, without complex auxiliary losses.

📚 Prerequisite Knowledge

Prerequisites

Deep Reinforcement Learning (SAC, PPO)
Neural Network Architectures (Residual connections, LayerNorm)
Fourier Analysis (for function complexity)

Key Terms

Simplicity Bias: The tendency of a neural network to learn simpler (lower frequency) functions that generalize better, rather than overfitting to noise.

SAC: Soft Actor-Critic—an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework.

MLP: Multilayer Perceptron—a standard feedforward neural network consisting of fully connected layers.

Fourier Analysis: A mathematical method used here to decompose the network's function into frequencies; high frequencies indicate complexity/overfitting, low frequencies indicate simplicity.

DMC: DeepMind Control Suite—a widely used benchmark for continuous control physics tasks.

PPO: Proximal Policy Optimization—an on-policy policy gradient algorithm.

TD-MPC2: Temporal Difference Model Predictive Control 2—a model-based RL algorithm.

Residual Connection: A skip connection that adds the input of a layer to its output, facilitating gradient flow and allowing the network to learn identity functions easily.