Horizon Reduction Makes RL Scalable

📝 Paper Summary

Offline Reinforcement Learning Scalability Analysis Goal-Conditioned RL

The authors identify effective horizon length as the primary bottleneck preventing offline RL from scaling to complex tasks and propose SHARSA, a minimal method combining n-step returns and hierarchical policies to unlock scalability.

Core Problem

Standard offline RL algorithms (like IQL and SAC+BC) fail to solve complex, long-horizon tasks even when scaled to massive datasets (1 billion transitions) and larger models.

Why it matters:

Current offline RL methods saturate well below optimal performance on hard tasks, contradicting the promise that more data/compute yields better results
The 'curse of horizon' causes bias accumulation in value learning (TD error compounding) and complexity in policy learning that simple scaling cannot fix
Without solving this, offline RL cannot leverage large-scale unlabeled datasets effectively, unlike success stories in NLP and vision

Concrete Example: In the 'cube-octuple' task (picking and placing 8 blocks), standard methods like IQL and SAC+BC achieve near 0% success even with 1 billion transitions, because the error in Q-value estimation accumulates over the long sequence of actions required.

Key Novelty

SHARSA (Scalable Horizon-Aware RL via SARSA)

Identifies 'effective horizon' as the key scaling bottleneck: long horizons cause compounding errors in Q-values and complex policy mappings
Proposes a minimal algorithm combining n-step returns (reducing value horizon) and hierarchical policies (reducing policy horizon) to mitigate these issues
Demonstrates that explicitly reducing the horizon allows offline RL to scale effectively with data, achieving asymptotic performance where standard methods fail

Architecture

Conceptual diagram of SHARSA's horizon reduction strategy vs. standard methods

Evaluation Highlights

Standard methods (IQL, SAC+BC, CRL) achieve ~0% success on the hardest task (cube-octuple) even with 1 billion transitions
SHARSA achieves near 100% success on cube-octuple with the same 1B dataset, drastically outperforming baselines
Increasing model size to 591M parameters for SAC+BC fails to solve hard tasks, while horizon reduction works with standard 1M parameter models

Breakthrough Assessment

9/10

Provides a crucial negative result (standard RL doesn't scale just with data) and a strong positive solution (horizon reduction unlocks scaling), backed by massive 1B-scale experiments.

⚙️ Technical Details

Problem Definition

Setting: Offline Goal-Conditioned Reinforcement Learning

Inputs: State s, Goal g, Dataset D containing trajectories (s, a, s', r)

Outputs: Policy π(a|s,g) that reaches goal g from state s

Pipeline Flow

Goal Selection (High-Level Policy)
Action Execution (Low-Level Policy)
Value Estimation (Critic)

System Modules

High-Level Policy (Subgoal Generator) (Policy Hierarchy)

Generates a subgoal k steps into the future given current state and final goal

Model or implementation: MLP (Flow Matching Policy)

Low-Level Policy (Action Generator) (Policy Hierarchy)

Generates atomic action given state and immediate subgoal

Model or implementation: MLP (Flow Matching Policy)

Critic (Value Function)

Estimates Q-values using n-step SARSA to minimize bias accumulation

Model or implementation: MLP

Novel Architectural Elements

Integration of n-step SARSA specifically to address offline RL bias accumulation
Minimal hierarchical structure combining flow-matching policies with n-step value estimation to explicitly reduce effective horizon

Modeling

Base Model: MLPs (Multi-Layer Perceptrons) with [1024, 1024, 1024, 1024] hidden units

Training Method: SHARSA (n-step SARSA + Hierarchical Flow BC)

Objective Functions:

Purpose: Estimate value function with reduced bias accumulation.

Formally: Minimize (Q(s, a, g) - (sum of discounted rewards + gamma^n * Q(s_n, a_n, g)))^2 (n-step SARSA)
Purpose: Learn high-level and low-level policies via supervised regression.

Formally: Maximize log likelihood of actions/subgoals in dataset (Flow Matching / BC)

Key Hyperparameters:

network_architecture: [1024, 1024, 1024, 1024]
activation: ReLU
learning_rate: 3e-4 (standard), 1e-4 (large models)
+ 3 more
training_steps: 5M
batch_size: 2048
discount_factor: 0.99

Compute: Experiments use up to 591M parameter models (taking 8 days to train); standard experiments use smaller MLP models.

Comparison to Prior Work

vs. IQL/SAC+BC: SHARSA explicitly targets horizon reduction via n-step returns and hierarchy, whereas baselines struggle with bias accumulation in long-horizon TD learning.
vs. CRL: SHARSA uses temporal difference learning (modified with n-step) rather than contrastive learning, showing better scaling on manipulation tasks.
vs. General Hierarchical RL: SHARSA focuses on minimal interventions (n-step + simple hierarchy) specifically for scaling offline RL, rather than exploration or complex planning.

Limitations

Relies on the assumption that n-step returns are feasible (requires partial trajectories in dataset)
Hierarchical approach assumes subgoals can be effectively learned/represented from data
Focuses primarily on goal-conditioned tasks (though claims applicability is broader)
Experiments use idealized state-based observations, not raw pixels

Reproducibility

Code: https://github.com/seohongpark/horizon-reduction

Code is publicly available at https://github.com/seohongpark/horizon-reduction. Datasets (OGBench) are public. Large-scale (1B) datasets were generated using scripted policies from OGBench.

📊 Experiments & Results

Evaluation Setup

Offline Goal-Conditioned RL on robotics tasks

Benchmarks:

OGBench (Simulated Robotics (Manipulation, Locomotion))

Metrics:

Success Rate
Q-value Error (for analysis)
TD Error (for analysis)
Statistical methodology: 95% confidence intervals reported over 4 random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling analysis comparing standard methods (IQL, SAC+BC, CRL, Flow BC) against SHARSA (Horizon Reduction) on 1 billion transition datasets.
cube-octuple (Hardest)	Success Rate	0	95	+95
humanoidmaze-giant	Success Rate	10	90	+80
combination-lock (Didactic)	Q Error (Log Scale)	100	1	-99
cube-octuple	Success Rate	0	0	0

Experiment Figures

Scaling curves of success rate vs. dataset size (log scale) for standard methods vs. reduced-horizon methods.

Analysis on didactic 'combination-lock' task showing TD Error and Q Error vs. Horizon Length for 1-step vs. 64-step DQN.

Main Takeaways

Standard offline RL algorithms do not automatically solve complex tasks simply by scaling up data (to 1B transitions) or model size (to 591M params).
The 'curse of horizon' is a fundamental blocker: bias in TD learning accumulates over long horizons, and policy mapping complexity increases.
Reducing the horizon via n-step returns (for values) and hierarchy (for policies) is highly effective, enabling SHARSA to solve tasks where baselines fail.
Analysis confirms that n-step returns significantly reduce Q-value error compared to 1-step returns, validating the bias accumulation hypothesis.

📚 Prerequisite Knowledge

Prerequisites

Offline Reinforcement Learning
Temporal Difference (TD) Learning
Goal-Conditioned RL
Hierarchical Reinforcement Learning

Key Terms

Offline RL: Reinforcement learning that learns a policy from a fixed dataset without interacting with the environment

TD learning: Temporal Difference learning—a method to estimate value functions by bootstrapping from current estimates

Horizon: The number of steps required to reach a goal or the length of the decision-making sequence

n-step returns: Calculating returns by summing rewards over n steps before bootstrapping, reducing the number of bootstrapping steps (and thus bias accumulation)

Goal-conditioned RL: RL where the agent must learn to reach various goal states specified as input

Hierarchical RL: Decomposing a complex task into high-level subgoals and low-level actions to simplify learning

SHARSA: The proposed method; combines n-step SARSA (for value learning) with hierarchical behavioral cloning (for policy learning)

Bias accumulation: The compounding of small errors in value estimation at each step of TD learning, which grows with the horizon length

IQL: Implicit Q-Learning—an offline RL method that avoids querying out-of-sample actions by using expectile regression

SAC+BC: Soft Actor-Critic with Behavioral Cloning regularization—a standard offline RL baseline

CRL: Contrastive RL—a method using contrastive learning for goal-conditioned tasks

Flow BC: Flow Behavioral Cloning—a conditional generative modeling approach for policy learning