Equivariant Reinforcement Learning under Partial Observability

📝 Paper Summary

Reinforcement Learning under Partial Observability Equivariant Neural Networks

The paper extends the theory of group-invariant MDPs to partially observable settings (POMDPs) and proposes equivariant recurrent actor-critic agents that leverage geometric symmetries to significantly improve sample efficiency in robotic tasks.

Core Problem

Standard Reinforcement Learning (RL) methods in partially observable domains (PORL) require massive amounts of data to learn policies, often failing to generalize across symmetric variations of the same task.

Why it matters:

Robotic learning is notoriously sample-inefficient, making real-world training prohibitive
Existing symmetry-preserving (equivariant) methods are limited to fully observable MDPs, failing in realistic scenarios where robots have limited sensors (e.g., top-down views hiding state properties)
Data augmentation is an inefficient alternative, requiring larger models and longer training times to learn symmetries that could instead be baked into the architecture

Concrete Example: In a 'Drawer-Opening' task, a robot sees a top-down view of two drawers but doesn't know which is unlocked. The optimal strategy (pulling a drawer to test it) is rotationally symmetric: if the chest rotates 90 degrees, the optimal action sequence should rotate 90 degrees. Standard agents must relearn this behavior for every orientation, while equivariant agents generalize instantly.

Key Novelty

Group-Invariant POMDPs & Equivariant Recurrent Agents

Formalizes 'Group-Invariant POMDPs', proving that if the environment dynamics and observation functions are symmetric, the optimal policy and value function are also symmetric (equivariant/invariant)
Embeds this symmetry directly into the agent's neural architecture using equivariant convolutions and recurrent layers, forcing the agent to treat rotated inputs as mathematically equivalent without seeing them during training
Extends popular RL algorithms (SAC, A2C) to be both recurrent (dealing with partial observability) and equivariant (dealing with symmetry)

Architecture

The architecture of the Equivariant Actor-Critic network.

Evaluation Highlights

Achieves ~95-100% success rate on real-robot Drawer Opening task with only 1.5k training steps, while non-equivariant baselines fail completely (<20%)
Outperforms non-equivariant recurrent baselines by large margins on 4 simulated robotic manipulation tasks, often reaching optimal performance 2-5x faster
Demonstrates robustness to unseen rotations during testing, maintaining high performance where standard baselines drop significantly

Breakthrough Assessment

8/10

Provides a solid theoretical foundation extending equivariant RL to POMDPs and demonstrates strong empirical gains on real hardware. It bridges a crucial gap for practical robotic learning.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) defined as (S, A, Ω, b0, T, R, O)

Inputs: Sequence of observations o_t (images) and past actions a_{t-1}

Outputs: Action distribution π(a|h) or action value Q(h, a)

Pipeline Flow

Observation Encoder (Equivariant CNN)
Temporal Aggregation (Equivariant RNN/LSTM)
Policy/Value Heads (Equivariant MLPs)

System Modules

Observation Encoder

Extract spatial features from image observations while preserving rotational symmetry

Model or implementation: Equivariant CNN (using C_n regular representations)

Temporal Aggregator

Integrate history of observations to form a belief state representation

Model or implementation: Equivariant LSTM (EqLSTM)

Actor Head

Output action probabilities

Model or implementation: Equivariant MLP

Critic Head

Estimate value of current history/action

Model or implementation: Invariant MLP

Novel Architectural Elements

Integration of Equivariant CNNs with Equivariant LSTMs for POMDP policies
Specific architectural constraints ensuring end-to-end equivariance (policy) and invariance (value) for history-dependent agents

Modeling

Base Model: Custom Equivariant Actor-Critic (based on EqSAC and EqA2C)

Training Method: Reinforcement Learning (Recurrent SAC and Recurrent A2C)

Objective Functions:

Purpose: Maximize expected return with entropy regularization (SAC).

Formally: J(π) = E[Σ γ^t (R(s,a) + αH(π(·|h)))]
Purpose: Minimize Bellman error for Q-function.

Formally: L_Q(θ) = E[(Q(h,a) - (r + γV(h')))^2]

Key Hyperparameters:

discount_factor_gamma: 0.99
polyak_tau: 0.005
learning_rate: 1e-3 (Critic), 1e-4 (Actor)
+ 3 more
replay_buffer_size: 100,000
batch_size: 32 or 64
symmetry_group: C4 or C8 (cyclic groups of rotations)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAD/DrQ: Uses hard-coded architectural constraints (weights) rather than data augmentation; theoretically guarantees equivariance rather than learning it approximately
vs. Standard Recurrent RL: Incorporates group-equivariant convolutions and LSTMs to exploit domain symmetries for faster learning
vs. Wang et al. (Equivariant Q-Learning) [cited]: Extends the framework from fully observable MDPs to POMDPs (histories instead of states)

Limitations

Assumes the symmetry group is known a priori (e.g., rotations)
Limited to discrete cyclic groups (C4, C8) in implementation to approximate continuous rotations
Requires domain knowledge to specify the correct group representations for observations and actions
Does not address symmetries beyond spatial rotations/reflections (e.g., scale or permutation)

Reproducibility

Code: https://github.com/mhai0905/EqPORL

Code is publicly available at https://github.com/mhai0905/EqPORL. Hyperparameters are detailed in Appendix D. Specific network architectures (layer counts, channel sizes) are provided in Appendix D.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation tasks with partial observability (e.g., limited field of view, hidden properties)

Benchmarks:

Drawer Opening (Information Gathering / Manipulation) [New]
Picking (Grasp selection in clutter)
Pushing (Object manipulation to target)
Moving (Object navigation)

Metrics:

Success Rate
Sample Efficiency (Number of steps to convergence)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on simulated manipulation tasks shows equivariant models converging much faster and to higher success rates.
Drawer Opening (Sim)	Success Rate	0.2	0.95	+0.75
Picking (Sim)	Success Rate	0.55	0.95	+0.40
Drawer Opening (Real)	Success Rate	0.15	1.0	+0.85
Drawer Opening (Real)	Success Rate	0.50	1.0	+0.50

Experiment Figures

Learning curves (Success Rate vs. Training Steps) for 4 simulated tasks comparing EqSAC/EqA2C against baselines.

Conceptual illustration of the Drawer Opening POMDP and its rotational symmetry.

Main Takeaways

Equivariant agents (EqSAC/EqA2C) consistently outperform non-equivariant baselines (SAC/A2C) and data-augmentation baselines (DrQ) in sample efficiency.
In tasks requiring information gathering (Drawer Opening), equivariant models learn the necessary exploration strategies much faster.
Real-world experiments confirm that policies trained in simulation with equivariant constraints transfer robustly to physical hardware.
The performance gap is largest in tasks with high symmetry (Picking, Drawer Opening) where the equivariant inductive bias provides the most information.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs vs POMDPs)
Group Theory (Symmetry Groups, Equivariance, Invariance)
Neural Network Architectures (CNNs, RNNs)

Key Terms

POMDP: Partially Observable Markov Decision Process—a framework where the agent cannot see the full state of the world and must make decisions based on partial observations (e.g., camera images)

Equivariance: A property where transforming the input (e.g., rotating an image) results in a corresponding transformation of the output (e.g., the action vector rotates)

Invariance: A property where transforming the input results in the *same* output (e.g., the value of a state doesn't change if the scene rotates)

Group C_n: Cyclic group of order n, representing discrete rotational symmetries (e.g., rotations by 90 degrees for C4)

SO(2): Special Orthogonal group of dimension 2, representing all continuous 2D rotations

Group Representation: A mapping that describes how group elements (like rotations) act on a specific vector space (like an image or feature map)

Recurrent Neural Network (RNN): A network with internal memory that processes sequences of inputs, essential for POMDPs to remember past observations

Actor-Critic: An RL architecture with two components: an Actor (policy) that decides actions and a Critic (value function) that evaluates them