Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

📝 Paper Summary

Offline Reinforcement Learning Robot Learning

PA-RL decouples reinforcement learning into action optimization via a critic and supervised policy training, enabling a single algorithm to effectively fine-tune diverse architectures like diffusion models and transformers.

Core Problem

Standard deep RL algorithms are often hard-coded for specific policy classes (e.g., Gaussian), making them unstable or mathematically incompatible with modern, expressive architectures like diffusion models or autoregressive transformers.

Why it matters:

Expressive policies like diffusion models are necessary for multimodal tasks but are notoriously difficult to fine-tune with standard RL due to gradient instability
Practitioners currently must use weaker algorithms (like simple re-ranking) or heavily modify loss functions to accommodate new policy architectures
Robotic foundation models (like OpenVLA) cannot easily be fine-tuned with autonomous trial-and-error data using off-the-shelf methods

Concrete Example: SAC (Soft Actor-Critic) relies on a reparameterization trick stable for Gaussian policies. When applied to a diffusion policy, this gradient propagation becomes unstable or intractable, causing training failure.

Key Novelty

Policy-Agnostic RL (PA-RL)

Treats the policy update as a supervised learning problem by training on 'optimized' actions rather than raw policy gradients
Generates training targets by sampling actions from the current policy, re-ranking them using a learned critic (global optimization), and refining them via gradient ascent (local optimization)
Decouples the choice of policy architecture from the RL optimization logic, allowing the same method to train Diffusion, Transformers, and Gaussians

Architecture

The PA-RL training loop illustrating the separation of action optimization from policy training.

Evaluation Highlights

Successfully fine-tunes OpenVLA (7B parameter robot policy) on a real robot, improving success rates from 40% to 70% in 40 minutes
Improves diffusion policies on real-world WidowX manipulation tasks by 80-100% within 1-2 hours of online fine-tuning
Outperforms the next-best offline RL baselines by 13% in aggregate across various simulated domains (CALVIN, LIBERO)

Breakthrough Assessment

9/10

Significantly simplifies the landscape of RL fine-tuning by providing a universal method for modern architectures. The demonstration of autonomously fine-tuning a 7B foundation model (OpenVLA) on physical hardware is a major practical milestone.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) covering both Fully Offline RL and Offline-to-Online Fine-tuning

Inputs: State s, Offline Dataset D_off (initial), Online interactions (during fine-tuning)

Outputs: Optimal Policy pi(a|s) maximizing discounted cumulative reward

Pipeline Flow

Action Proposal (Base Policy)
Global Optimization (Critic Re-ranking)
Local Optimization (Gradient Ascent)
Policy Update (Supervised Learning)

System Modules

Base Policy

Generate initial action candidates given the state

Model or implementation: Variable (Gaussian, Diffusion, or Transformer/Autoregressive)

Critic (Q-function)

Estimate the expected return (Q-value) for state-action pairs

Model or implementation: MLP (typically trained via IQL or Cal-QL)

Action Optimizer

Refine actions to maximize Q-values

Model or implementation: Optimization Algorithm (Non-parametric)

Policy Updater

Update policy parameters to produce the optimized actions

Model or implementation: Supervised Learning Optimizer

Novel Architectural Elements

Insertion of an explicit 'Action Optimization' phase between the Critic and the Policy Update, combining global sampling and local gradient ascent
Universal use of Supervised Learning (NLL/DDPM loss) for the policy update step in an Actor-Critic loop, replacing architecture-specific policy gradients

Modeling

Base Model: Varies by experiment: OpenVLA (7B), Diffusion (DDPM-based), Transformer (Autoregressive)

Training Method: Actor-Critic RL (Policy Agnostic)

Objective Functions:

Purpose: Train the Critic to estimate values conservatively or implicitly.

Formally: Cal-QL or IQL value loss objectives (minimizing TD error with specific regularizers).
Purpose: Refine actions to create better training targets.

Formally: a* = argmax_a Q(s, a) followed by a* ← a* + η ∇_a Q(s, a).
Purpose: Update the policy to imitate the optimized actions.

Formally: Minimize L_supervised(π(·|s), a*), e.g., NLL or Diffusion Denoising Loss.

Adaptation: Fine-tuning of full weights or specific components (e.g., Low-Rank Adapters for large models like OpenVLA)

Key Hyperparameters:

action_candidates_N: 16 (Simulated), 8 (Real World)
local_optimization_steps: 10-20
learning_rate: 3e-4 (Policy), 3e-4 (Critic)
+ 3 more
batch_size: 256
discount_factor_gamma: 0.99
diffusion_steps: 5 or 15 (depending on variant)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. SAC: PA-RL avoids reparameterization gradients through the policy, preventing instability with complex backbones
vs. IDQL: PA-RL adds local optimization (gradient ascent on actions) and iteratively trains the policy on these optimized actions, rather than just re-ranking at inference time
vs. DQL: PA-RL uses a supervised loss for the policy update, which is more stable than backpropagating Q-gradients through a diffusion chain

Limitations

Computational cost of action optimization (sampling + gradients) during training can be higher than standard methods
Requires a differentiable Q-function with respect to actions for the local optimization step
Performance depends on the quality of the critic; if the Q-function is inaccurate OOD, optimization might lead to adversarial actions

Reproducibility

Code: https://PolicyAgnosticRL.github.io/

Code is publicly available at the project website. The paper details hyperparameters for both simulation and real-world experiments. OpenVLA weights are public, but the specific robot hardware (WidowX) is required for physical replication.

📊 Experiments & Results

Evaluation Setup

Offline RL and Offline-to-Online Fine-tuning in robotic manipulation

Benchmarks:

CALVIN (Simulated long-horizon robotic manipulation)
LIBERO (Simulated robotic manipulation suite)
WidowX Real Robot (Real-world pick-place and manipulation)

Metrics:

Success Rate
Average Return
Statistical methodology: Means and standard deviations reported over multiple seeds (3-5 seeds typical for sim)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Real-world experiments demonstrating the fine-tuning of foundation models (OpenVLA) and diffusion policies on physical hardware.
WidowX (Bridge V2 Dataset Task: Put eggplant in basket)	Success Rate	40	70	+30
WidowX (Bridge V2 Dataset Task: Put carrot in plate)	Success Rate	25	45	+20
Simulated experiments comparing PA-RL against SOTA offline RL methods across different policy backbones.
CALVIN (Ant)	Average Return	1.8	3.2	+1.4
Various (CALVIN, LIBERO)	Normalized Score	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Success rates of PA-RL versus baselines (IDQL, IQL, Cal-QL) on the CALVIN benchmark with Diffusion Policies.

Real-world improvement of OpenVLA and Diffusion policies on the WidowX robot.

Main Takeaways

PA-RL enables stable fine-tuning of diffusion and transformer policies where standard methods like SAC fail or require extensive tuning.
The method is particularly effective for long-horizon tasks (like CALVIN) where expressive multimodal policies are crucial.
Local optimization (gradient ascent on actions) provides a critical performance boost over simple re-ranking (global optimization) methods like IDQL.
Scales to very large models, demonstrated by the first successful autonomous RL fine-tuning of a 7B parameter OpenVLA model on a real robot.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Actor-Critic, Q-learning)
Diffusion Models (DDPM)
Transformers (Autoregressive sequence modeling)

Key Terms

PA-RL: Policy-Agnostic Reinforcement Learning—the proposed method that decouples action optimization from policy training

OpenVLA: A 7-billion parameter generalist robot policy based on a Vision-Language-Action architecture

Diffusion Policy: A policy that generates actions by iteratively denoising random noise, conditioned on the state

Autoregressive Policy: A policy that generates actions token-by-token (or dimension-by-dimension) sequentially, often using Transformers

SAC: Soft Actor-Critic—a popular off-policy RL algorithm that maximizes a trade-off between expected return and entropy

Cal-QL: Calibrated Q-Learning—an offline-to-online RL algorithm that learns a conservative value function to prevent overestimation

IQL: Implicit Q-Learning—an offline RL method that avoids querying out-of-sample actions during value training

NLL: Negative Log-Likelihood—a standard supervised learning loss function minimized to make the policy outputs match the target actions

WidowX: A specific type of robotic arm used for real-world manipulation experiments in the paper

CALVIN: A simulation benchmark for long-horizon robotic manipulation tasks