Continuous Control with Coarse-to-fine Reinforcement Learning

📝 Paper Summary

Continuous Control Robotic Manipulation Action Discretization

CQN enables stable, sample-efficient value-based reinforcement learning in continuous action spaces by iteratively zooming into the action space through multi-level discretization.

Core Problem

Applying RL to real-world robotics is difficult because actor-critic methods are unstable and sample-inefficient, while value-based methods struggle with the trade-off between action precision and the curse of dimensionality in continuous spaces.

Why it matters:

Real-world robots require sample efficiency due to hardware wear and reset costs, making data-hungry on-policy methods impractical
Standard discretization limits precision: fine grids explode the action space (hard to learn), while coarse grids lack the dexterity needed for manipulation

Concrete Example: In a robotic manipulation task requiring high precision (e.g., inserting a plug), a standard discrete agent with few bins misses the target, while one with many bins takes too long to explore and learn. CQN starts coarse to find the general area, then zooms in for precision.

Key Novelty

Coarse-to-fine Q-Network (CQN)

Iterative Zooming: instead of outputting one high-precision action, the agent discretizes the space into a few bins, selects the best one, and then re-discretizes that specific interval at the next level
Multi-level Critic: A single value-based network structure that takes the previous level's decisions as input to inform the next level's finer-grained choice
Efficient Precision: Achieves high continuous control precision with very few bins per level (e.g., 3 bins) by chaining multiple levels, avoiding the exponential explosion of actions

Architecture

Overview of the Coarse-to-fine Reinforcement Learning (CRL) framework and CQN architecture.

Evaluation Highlights

Outperforms RL and BC baselines on 20 sparsely-rewarded RLBench tasks using only 100 demonstrations and 100k online interactions
Achieves competitive performance to state-of-the-art actor-critic baselines (DrQ-v2) on DeepMind Control Suite tasks
Robustly solves real-world manipulation tasks (e.g., stacking blocks) within minutes of online training

Breakthrough Assessment

8/10

Significantly improves sample efficiency for precise continuous control by cleverly adapting value-based methods. The removal of the actor network simplifies the architecture while maintaining high precision.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) in continuous action space

Inputs: Pixel observations o_t and proprioceptive states

Outputs: Continuous action vector a_t (constructed from multi-level discrete choices)

Pipeline Flow

Visual Encoder (processes pixels)
Coarse-to-fine Inference Loop (Levels 1 to L)
Action Execution

System Modules

Visual Encoder

Encodes pixel observations into low-dimensional feature vectors

Model or implementation: 4-layer CNN with layer normalization

Coarse-to-fine Critic (Level 1 to L)

Iteratively selects action intervals to zoom into

Model or implementation: Factorized Q-networks (shared parameters across levels/dimensions)

Novel Architectural Elements

Recursive Q-network structure where level l takes the full action vector from level l-1 as input
Parameter sharing across all levels and action dimensions (conditioned on level index and dimension index)

Modeling

Base Model: Custom CNN + MLP architecture

Training Method: Value-based RL (Distributional Q-learning) with auxiliary BC loss

Objective Functions:

Purpose: Minimize distributional temporal difference error.

Formally: Cross-entropy between predicted return distribution and target distribution (C51 algorithm).
Purpose: Encourage expert-like actions (Behavior Cloning).

Formally: Margin loss encouraging Q(expert) > Q(other) + margin.
Purpose: Self-imitation.

Formally: Treating successful online trajectories as demonstrations.

Training Data:

Replay buffer mixing online experiences and expert demonstrations (50/50 ratio)
Modest number of demonstrations (e.g., 100)

Key Hyperparameters:

optimizer: AdamW
weight_decay: 0.1
batch_size: 512 (256 online + 256 demo)
+ 4 more
atoms: 51 (for distributional critic)
noise_std: 0.01 (exploration)
bins_per_level_B: 3 (typically)
levels_L: Not explicitly summarized in text body but implied as hyperparam

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. DrQ-v2: CQN is value-based (critic-only) and uses coarse-to-fine discretization rather than continuous actor gradients
vs. C2F-ARM: CQN is general for any continuous control (joints), whereas C2F-ARM is specific to next-best-pose (end-effector) control
vs. Single-level Discretization [not cited in paper]: CQN uses multi-level zooming to achieve precision with few bins, whereas single-level methods require huge action spaces for precision

Limitations

Inference time scales linearly with the number of levels L
Requires defining bounds for the action space beforehand
Performance depends on the quality of demonstrations for the auxiliary loss

Reproducibility

Code: https://younggyo.me/cqn

Code is publicly available at younggyo.me/cqn. Method relies on standard RL components (CNNs, MLPs, Replay Buffers). Hyperparameters like noise and batch mix are specified.

📊 Experiments & Results

Evaluation Setup

Visuomotor control tasks in simulation and real-world

Benchmarks:

RLBench (Robotic manipulation (sparse rewards))
DeepMind Control Suite (DMC) (Continuous control (locomotion/manipulation))

Metrics:

Success Rate
Episode Return
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CQN significantly outperforms baselines on sparsely rewarded RLBench tasks, demonstrating sample efficiency.
RLBench (20 tasks average)	Success Rate	Not reported in the paper	Not reported in the paper	-
CQN is competitive with state-of-the-art actor-critic methods on standard continuous control benchmarks.
DMC	Performance	Competitive	Competitive	0

Main Takeaways

CQN enables high-precision control with very few bins (e.g., 3) by using multiple levels, solving the trade-off between precision and action space size.
The method is robust to the absence of standard robotics aids like motion planning, camera calibration, or depth sensors.
Auxiliary objectives (Behavior Cloning and Self-Imitation) are crucial for sample efficiency in sparse reward settings.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Q-learning, Actor-Critic)
Discretization of continuous spaces
Robotic manipulation basics

Key Terms

CRL: Coarse-to-fine Reinforcement Learning—a framework where agents iteratively zoom into continuous action spaces

CQN: Coarse-to-fine Q-Network—the specific value-based algorithm implementation of the CRL framework

Actor-Critic: RL architecture with separate policy (actor) and value (critic) networks; often unstable in continuous control

Value-based RL: RL methods (like DQN) that learn Q-values for actions and select the best one, typically more stable but naturally discrete

BC: Behavior Cloning—supervised learning from expert demonstrations

C2F-ARM: A prior coarse-to-fine method specific to next-best-pose agents; CQN is a generalization of this to continuous joint control

Distributional Critic: A critic that predicts the full distribution of returns (probabilities over value ranges) rather than just the mean expectation

Polyak averaging: A technique to update target network parameters slowly (moving average) to stabilize training

Dueling Network: A neural network architecture that separates state value estimation from action advantage estimation

SiLU: Sigmoid Linear Unit—an activation function used in the neural networks

AdamW: An optimization algorithm with weight decay fix

RLBench: A benchmark environment for robot learning tasks

DMC: DeepMind Control Suite—a standard benchmark for continuous control physics tasks