CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning

📝 Paper Summary

Offline Inverse Reinforcement Learning (Offline IRL) Model-Based Reinforcement Learning

CLARE mitigates reward extrapolation error in offline IRL by learning a conservative reward function that penalizes uncertain model rollouts while exploiting expert and diverse data.

Core Problem

Offline IRL suffers from 'reward extrapolation error,' where learned rewards incorrectly value unseen states due to covariate shift, misguiding agents when they stray from the expert's narrow distribution.

Why it matters:

Standard IRL requires costly online interactions, while offline methods struggle to generalize beyond static datasets.
Without reinforcement signals, learned rewards often assign high values to out-of-distribution states (OOD), leading to catastrophic policy failure in safety-critical domains like robotics.

Concrete Example: In a MuJoCo task, an agent trained with standard MaxEnt IRL might learn a reward function that assigns high value to a physically impossible pose never seen in expert data. When the agent attempts this pose during deployment, it fails, because the reward function didn't know to penalize it.

Key Novelty

Conservative Model-Based Reward Learning

Introduce a pointwise weighting mechanism for reward updates that assigns positive weights to data with low model uncertainty and negative weights (penalties) to uncertain model rollouts.
Alternates between updating the reward function to explain data conservatively and optimizing a policy within the learned dynamics model (safe policy improvement).
Theoretical derivation of optimal weights that minimize the return gap between the learned policy and the expert policy.

Architecture

Conceptual illustration of the two-tier tradeoffs in CLARE: Exploitation of Expert/Diverse data vs. Exploration of Model-based synthetic data.

Evaluation Highlights

Outperforms state-of-the-art IQ-LEARN by over +2000 average return on Half-Cheetah (Expert & Medium dataset mixture).
Achieves expert-level performance on Walker2d (5010.4 return) using only 10k expert tuples, significantly surpassing Behavior Cloning (4990.5) and IQ-LEARN (1665.7).
Demonstrates robust convergence in fewer than 50k gradient steps across multiple MuJoCo tasks.

Breakthrough Assessment

8/10

Provides a principled theoretical framework and strong empirical results for a critical problem (reward extrapolation) in offline IRL, significantly outperforming recent baselines like IQ-Learn.

⚙️ Technical Details

Problem Definition

Setting: Offline Inverse Reinforcement Learning (IRL) in a Markov Decision Process (MDP).

Inputs: Static expert dataset D_E and diverse (lower-quality) dataset D_B.

Outputs: A learned reward function r(s,a) and a policy π(a|s) that mimics/outperforms the expert.

Pipeline Flow

Step 1: Learn Dynamics Model (Ensemble of NNs) from offline data
Step 2: Calculate Uncertainty Weights β(s,a) for offline data
Step 3: Iterative Loop (Reward Update ↔ Policy Improvement)

System Modules

Dynamics Model Ensemble

Estimate transition dynamics and quantify uncertainty via ensemble disagreement

Model or implementation: Ensemble of 7 Probabilistic Neural Networks (best 5 selected)

Conservative Reward Updater

Update reward function to match expert feature expectations while penalizing uncertainty

Model or implementation: 4-layer MLP (256 units)

Safe Policy Improver

Optimize policy using the current conservative reward and learned dynamics

Model or implementation: SAC (Soft Actor-Critic) with 2-layer MLP (256 units)

Novel Architectural Elements

Pointwise weight parameter β(s,a) mechanism integrated into the MaxEnt IRL objective
Dual-buffer system: Replay buffer for model rollouts + Offline dataset buffers for reward weighting

Modeling

Base Model: Feedforward Neural Networks (MLPs) for Actor, Critic, Reward, and Dynamics

Training Method: Alternating Optimization (Reward Learning + Model-Based Policy Optimization)

Objective Functions:

Purpose: Update reward to explain expert data while penalizing high-uncertainty regions.

Formally: Minimize L(r|π) involving terms E[r(s,a)] on model rollouts (penalized), expert data (maximized), and diverse data (weighted).
Purpose: Improve policy safely within the learned model.

Formally: Maximize E[r(s,a)] + αH(π) - λ D_KL(π_b || π) using SAC.

Key Hyperparameters:

reward_learning_rate: 5e-5
actor_critic_learning_rate: 3e-4
dynamics_learning_rate: 1e-3
+ 4 more
policy_regularizer_weight_lambda: 0.25
discount_factor: 0.99
batch_size: 5000 (rollout)
conservatism_level_u: 0.4, 0.6, or 0.8 (tuned per task)

Compute: Not reported in the paper

Comparison to Prior Work

vs. IQ-LEARN: CLARE uses a dynamics model for generalization and explicit conservatism weights, whereas IQ-LEARN is model-free.
vs. MOMAX: CLARE uses pointwise weights β(s,a) derived from theoretical analysis to balance exploration/exploitation, whereas MOMAX naively combines IRL and model-based RL.
vs. BC: CLARE infers rewards and dynamics, enabling better generalization than simple supervised cloning.

Limitations

Relies on the ability to learn a reasonable dynamics model; may fail if dynamics are too complex to estimate from offline data.
Requires tuning the conservatism parameter 'u', which varies across tasks.
Theoretical analysis assumes finite state/action spaces, though implementation works in continuous domains.
Computationally more intensive than model-free methods due to dynamics training and rollouts.

Reproducibility

Code: https://github.com/polixir/OfflineRL

Code implementation is built upon the public framework https://github.com/polixir/OfflineRL. The paper provides detailed hyperparameters in Appendix A.2 and algorithms in pseudo-code. State-action pairs are sampled from D4RL.

📊 Experiments & Results

Evaluation Setup

Continuous control tasks using the D4RL offline RL benchmark suite.

Benchmarks:

MuJoCo (Half-Cheetah, Walker2d, Hopper, Ant) (Continuous Control / Locomotion)

Metrics:

Average Return (likely raw return based on magnitude, e.g., >3000 for Walker)
Statistical methodology: Results averaged over 7 random seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Mixed Quality Data (Expert + Medium): CLARE consistently outperforms baselines when data is a mix of expert and sub-optimal demonstrations.
Walker2d (Exp & Med)	Average Return	1674.2	3613.4	+1939.2
Hopper (Exp & Med)	Average Return	2135.0	1422.7	-712.3
Half-Cheetah (Exp & Med)	Average Return	2375.0	4667.8	+2292.8
Performance on Expert Data Only: CLARE matches or exceeds baselines even with limited high-quality data.
Walker2d (Expert)	Average Return	4990.5	5010.4	+19.9
Ant (Expert)	Average Return	3940.3	5172.8	+1232.5

Experiment Figures

Performance curves (Average Return vs Number of Tuples) for CLARE against baselines on 4 environments.

Analysis of conservatism parameter u, convergence speed, and recovered reward quality.

Main Takeaways

CLARE yields the best performance by a significant margin on almost all datasets, particularly those with low-quality data (mixed expert/random or expert/medium).
The learned reward function effectively guides offline policy search while exploiting useful knowledge in diverse (sub-optimal) data.
Ablation studies on parameter 'u' (conservatism) show that increased conservatism (lower u) generally improves performance, validating the importance of penalizing uncertain regions.
The method converges efficiently, often within 5 iterations (less than 50k gradient steps).

📚 Prerequisite Knowledge

Prerequisites

Inverse Reinforcement Learning (IRL)
Model-Based Reinforcement Learning
MaxEnt (Maximum Entropy) framework
Distributional shift / Covariate shift

Key Terms

Reward Extrapolation Error: The failure of a learned reward function to correctly evaluate states outside the training distribution, often leading to hallucinated high rewards for bad actions.

Covariate Shift: The difference between the distribution of states visited by the expert and the states visited by the learning agent.

MaxEnt IRL: Maximum Entropy IRL—a framework that seeks a reward function under which the expert's behavior is the most probable (highest entropy) path.

Model-Based RL: RL approaches that learn a transition dynamics model (T) from data and use it to simulate experiences (rollouts) for planning or training.

SAC: Soft Actor-Critic—an off-policy RL algorithm that maximizes a trade-off between expected return and entropy.

Occupancy Measure: The distribution of state-action pairs visited by a policy.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning—a standard benchmark suite for offline RL.