Actor-Critic Pretraining for Proximal Policy Optimization

📝 Paper Summary

Reinforcement Learning (RL) Imitation Learning Robotic Manipulation & Locomotion

Actor-Critic Pretraining (ACP) improves RL sample efficiency by initializing the critic network using rollouts from a behaviorally cloned actor, rather than initializing only the actor.

Core Problem

Deep Reinforcement Learning (RL) is highly sample-inefficient and prone to unsafe exploration, and standard pretraining methods (Behavioral Cloning) typically ignore the critic network.

Why it matters:

RL requires millions of interactions, causing physical wear and time costs in real-world robotics
Standard actor-only pretraining often leads to 'catastrophic forgetting' where performance drops initially because the randomly initialized critic provides poor guidance
Existing solutions like PIRL (Pretraining with Imitation and RL fine-tuning) can be unstable or slow to improve beyond the expert baseline

Concrete Example: In the Walker2D task, standard PPO fails to reach the target return within the training budget. Actor-only pretraining suffers an initial performance crash (catastrophic forgetting) before recovering. ACP avoids this crash and converges significantly faster.

Key Novelty

Actor-Critic Pretraining (ACP) with Residual Architecture

Pretrains the actor via Behavioral Cloning on expert data, then freezes it to generate rollouts
Pretrains the critic using the returns from these specific rollouts (ensuring value estimates match the pretrained policy's behavior)
Uses a residual actor architecture where the backbone is frozen during fine-tuning but a residual connection allows the decision head to adapt, preventing the loss of expert 'instincts'

Architecture

Conceptual flow of the pretraining and fine-tuning approach, showing the separation of Actor and Critic initialization.

Evaluation Highlights

86.1% average reduction in environment steps compared to PPO with no pretraining across 15 tasks
30.9% average reduction in environment steps compared to standard Actor-Only Pretraining (AP)
Mitigates catastrophic forgetting in complex environments like Ant and Walker2D where Actor-Only Pretraining initially degrades performance

Breakthrough Assessment

7/10

Solid empirical improvement (30%+) over strong baselines (Actor-Only Pretraining) in standard benchmarks. Addresses a clear logical gap (critic initialization) with a straightforward method. Limited to simulated robotics so far.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) with continuous action spaces

Inputs: State vector s_t

Outputs: Action vector a_t (continuous)

Pipeline Flow

Expert Data Collection
Actor Pretraining (BC)
Rollout Generation (using pretrained Actor)
Critic Pretraining (Supervised on Rollout Returns)
PPO Fine-tuning

System Modules

Actor

Selects actions based on state

Model or implementation: Feedforward Network with Residual Head

Critic

Estimates state-value V(s)

Model or implementation: Feedforward Network

Novel Architectural Elements

Residual Actor Architecture: Backbone features + direct state input via residual connection fed into decision head. During fine-tuning, backbone is frozen, only decision head updates.

Modeling

Base Model: Feedforward MLPs (ReLU activations)

Training Method: PPO with specific pretraining phases

Objective Functions:

Purpose: Pretrain Actor to mimic expert.

Formally: Minimize MSE ||a_t - π_μ(s_t)||² over expert data D_exp
Purpose: Pretrain Critic to match pretrained policy's returns.

Formally: Minimize MSE (v_φ(s_t) - G_rollout)² over rollout data D_rol
Purpose: Fine-tune both networks using PPO.

Formally: Maximize L_PPO = L_CLIP - c1*L_VF + c2*S[π]

Training Data:

Expert data D_exp: Generated by sub-optimal expert (65% of target return)
Rollout data D_rol: Generated by running the BC-pretrained actor in the environment

Key Hyperparameters:

sigma (actor variance): e^-2 (fixed during pretraining)
expert_performance_ratio: 0.65 (expert achieves 65% of target return)
ppo_fine_tuning_budget: 10^6 environment steps
+ 2 more
discount_factor_gamma: Not explicitly listed (standard values implied)
gae_lambda: Not explicitly listed

Compute: Not reported in the paper

Comparison to Prior Work

vs. AP: ACP initializes the critic using supervised learning on rollouts, preventing the initial value estimation mismatch that causes forgetting
vs. PIRL: ACP uses offline/rollout supervision for the critic pre-start, rather than an initial 'frozen actor' RL phase
vs. DAPG [not cited in paper]: DAPG (Demonstration Augmented Policy Gradient) combines BC loss with RL loss during fine-tuning, whereas ACP focuses on initialization and architectural freezing

Limitations

Requires expert demonstrations which may not always be available
Requires environment interaction (rollouts) during the pretraining phase, adding a cost before RL begins
Did not improve sample efficiency in 3 out of 15 environments (mostly high-dimensional Humanoid tasks)
Uses a non-linear hyperparameter (number of rollout steps) that is environment-specific and hard to tune beforehand

Reproducibility

No replication artifacts mentioned in the paper. Code URL is not provided. Uses standard Gym/Gymnasium-Robotics environments and RL Baselines3 Zoo configurations.

📊 Experiments & Results

Evaluation Setup

15 simulated robotic manipulation and locomotion tasks from Gymnasium and Gymnasium-Robotics

Benchmarks:

MuJoCo Locomotion (Locomotion (Ant, Hopper, Walker2D, Humanoid, etc.))
Fetch Robotics (Manipulation (Reach, Push, Slide, PickAndPlace))

Metrics:

Total Environment Steps (to reach target return)
Sample Reduction (%)
Statistical methodology: Returns averaged over 3 random seeds; standard deviation shown in plots

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ACP consistently reduces the total environment steps required to reach a target return compared to baselines.
Walker2D	Total Steps (x10^3)	190.6	56.1	-134.5
Ant	Total Steps (x10^3)	191.0	98.7	-92.3
Hopper	Total Steps (x10^3)	87.0	30.9	-56.1
FetchPickAndPlace	Total Steps (x10^3)	210.0	106.0	-104.0
Reacher	Total Steps (x10^3)	196.8	58.7	-138.1
Ablation results demonstrate the specific contribution of architectural features.
All Environments (Average)	Sample Efficiency Gain	0.0	22.1	+22.1

Experiment Figures

Learning curves (Episodic Return vs Environment Steps) for Ant, Walker2D, and FetchReach.

Total environment steps required (n_tot) vs number of rollout steps (n_rol) used for critic pretraining.

Main Takeaways

Critic pretraining is crucial: Initializing the critic to match the pretrained actor's value function prevents the 'catastrophic forgetting' seen in Actor-Only pretraining.
Residual architecture helps: Freezing the backbone while allowing a residual decision head to update preserves expert features while allowing fine-tuning adaptation.
Rollouts are necessary but saturating: A moderate number of rollouts for critic pretraining is optimal; excessive rollouts yield diminishing returns.
Exceptions exist: High-dimensional observation spaces (e.g., Humanoid) did not benefit from critic pretraining compared to Actor-Only pretraining.

📚 Prerequisite Knowledge

Prerequisites

Proximal Policy Optimization (PPO)
Behavioral Cloning (BC)
Actor-Critic Architecture
Value Function estimation

Key Terms

PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm that improves stability by limiting how much the policy can change in one step

Actor-Critic: An RL architecture with two networks: an Actor that decides which action to take, and a Critic that estimates how good that state is (value function)

Behavioral Cloning (BC): A form of imitation learning where a model is trained via supervised learning to mimic expert actions given states

Catastrophic Forgetting: A phenomenon where a neural network abruptly loses previously learned knowledge (here, the expert behavior) when training on new data

Rollout: A sequence of interactions (state, action, reward) generated by running a policy in the environment

GAE: Generalized Advantage Estimation—a method to reduce variance in policy gradient estimates

Residual Connection: A skip connection in a neural network that allows gradients/information to bypass intermediate layers, often used here to preserve expert features

PIRL: Pretraining with Imitation and RL fine-tuning—a baseline method where the actor is frozen while the critic is trained, before joint optimization