FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

📝 Paper Summary

Robot Learning Foundation Models in Robotics Sim-to-Real Transfer

FLaRe fine-tunes large-scale multi-task behavior cloning policies using stable on-policy reinforcement learning in simulation with sparse rewards, significantly improving performance on unseen tasks and real-world robots.

Core Problem

Large-scale behavior cloning (BC) policies suffer from compounding errors and struggle to generalize to unseen states or tasks, leading to unsatisfactory real-world performance despite high training capacity.

Why it matters:

Direct deployment of BC policies hits a performance plateau because models are constrained to expert trajectories and cannot recover effectively from errors.
RL from scratch is sample inefficient and requires difficult-to-scale hand-crafted reward functions.
Prior attempts to fine-tune BC with RL often fail due to destructive gradient updates (policy collapse) when scaling to large networks.

Concrete Example: A robot trained via BC might fail a navigation task if it drifts slightly off the expert path, as it has never seen how to recover. FLaRe allows the robot to learn recovery behaviors through trial-and-error in simulation using only a 'success/fail' signal.

Key Novelty

Stable Large-Scale RL Fine-Tuning of Robotics Foundation Models

Starts with a multi-task transformer policy (SPOC) pre-trained via BC, utilizing its robust representations as a starting point rather than training from scratch.
Performs massive-scale fine-tuning in simulation (ProcTHOR) using only sparse rewards (task completion), bypassing the need for complex reward engineering.
Stabilizes training via specific algorithmic choices: on-policy PPO, very small learning rates, disabling entropy bonuses, and separating actor/critic feature extractors to prevent 'unlearning' pre-trained priors.

Architecture

The FLaRe framework pipeline, illustrating the transition from a pre-trained multi-task BC policy to RL fine-tuning in simulation.

Evaluation Highlights

+23.6% absolute improvement in success rate over state-of-the-art baselines on unseen simulated environments for long-horizon mobile manipulation tasks.
+30.7% absolute improvement over prior best methods in real-world deployment (80.7% success rate vs 50.0%).
Achieves 15x reduction in training time compared to RL-from-scratch baselines like Poliformer while using only sparse rewards.

Breakthrough Assessment

8/10

Strong practical contribution demonstrating that RL fine-tuning can fix the brittleness of foundation models in robotics. The sim-to-real results are impressive, and the method simplifies reward engineering significantly.

⚙️ Technical Details

Problem Definition

Setting: Language-conditioned Partially Observable Markov Decision Process (POMDP)

Inputs: RGB observations and natural language instructions

Outputs: Discrete actions (base movement, arm movement, gripper, END token)

Pipeline Flow

Input Processing (RGB + Language)
Feature Extraction (DinoV2)
Transformer Policy (Actor/Critic)
Action Generation

System Modules

Visual Encoder

Extract visual features from RGB images

Model or implementation: DinoV2 (frozen)

Policy Network (Actor) (Decision Making)

Predict the next action distribution based on history

Model or implementation: Transformer (SPOC architecture, initialized from BC weights)

Value Network (Critic) (Decision Making)

Estimate the value of the current state for RL updates

Model or implementation: Transformer (SPOC architecture, initialized from BC weights)

Novel Architectural Elements

Complete separation of Actor and Critic networks (no shared backbone) to prevent RL gradients from distorting pre-trained representations in the Actor.
Removal of entropy bonus in PPO to prevent 'unlearning' the pre-trained BC policy distribution at the start of fine-tuning.

Modeling

Base Model: SPOC (transformer-based policy pre-trained on Objaverse-Populated ProcTHOR)

Training Method: PPO (Proximal Policy Optimization) with specific stabilization modifications

Objective Functions:

Purpose: Maximize expected return (success rate) using sparse rewards.

Formally: Standard PPO clipped surrogate objective.
Purpose: Stabilize updates.

Formally: Entropy bonus term is set to 0 (removed).

Adaptation: Full fine-tuning of separate Actor and Critic networks

Training Data:

150k procedurally generated ProcTHOR houses
800K+ annotated 3D objects

Key Hyperparameters:

learning_rate: 2e-5
optimizer: Adam
reward_signal: Sparse (1 for success, 0 otherwise)
+ 1 more
entropy_coefficient: 0.0

Compute: 15x reduction in training time compared to RL from scratch (Poliformer)

Comparison to Prior Work

vs. Poliformer: Fine-tunes a foundation model rather than training from scratch; uses sparse rewards instead of dense.
vs. JSRL: Uses on-policy PPO instead of off-policy RL; finds off-policy unstable for large networks.
vs. PIRLNav: Uses separate actor/critic and no entropy bonus for stability; scales to larger multi-task settings.
+ 2 more
vs. RLPD [not cited in paper]: RLPD uses efficient off-policy RL with demonstrations, whereas FLaRe focuses on on-policy fine-tuning of large transformers in simulation.
vs. CalQL [not cited in paper]: CalQL is offline-to-online RL, whereas FLaRe is BC-to-online RL via simulation.

Limitations

Relies on the existence of a high-quality simulation environment (ProcTHOR) for the fine-tuning phase.
Requires a pre-trained foundation model (SPOC) to start from; cannot effectively learn completely alien tasks from scratch without prior.
Computationally intensive due to running large transformer models in simulation loops (mitigated by KV-cache).

Reproducibility

Code: https://robot-flare.github.io

Code and videos available at project website. Uses public datasets (ProcTHOR, Objaverse). Specific hyperparameters (LR 2e-5, no entropy) are explicitly detailed.

📊 Experiments & Results

Evaluation Setup

Mobile manipulation in simulated (ProcTHOR) and real-world environments

Benchmarks:

CHORES-S (Household robot tasks (Navigation, Manipulation, Exploration))
Real World Evaluation (Navigation and pickup tasks in physical office/kitchen) [New]

Metrics:

Success Rate (SR)
Episode-Length weighted Success (SEL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results on unseen environments (CHORES-S benchmark) comparing FLaRe against baselines using both sparse (Fair) and dense (Unfair) rewards.
CHORES-S (Average)	Success Rate	51.1	79.5	+28.4
CHORES-S (Average)	Success Rate	55.9	79.5	+23.6
Real-world evaluation results comparing FLaRe to the base SPOC model and best baseline.
Real Robot Tasks (Average)	Success Rate	47.7	80.7	+33.0
Real Robot Tasks (Average)	Success Rate	50.0	80.7	+30.7
Ablation study showing the necessity of stabilization techniques.
ObjectNav	Success Rate	0.0	88.7	+88.7
ObjectNav	Success Rate	0.0	88.7	+88.7

Experiment Figures

Bar charts comparing success rates of FLaRe against baselines (SPOC, Poliformer, etc.) across 4 CHORES tasks in simulation.

Training curves (Success Rate vs. Steps) for FLaRe and baselines.

Main Takeaways

Fine-tuning from multi-task BC models is far more effective than RL from scratch, even when using sparse rewards against dense reward baselines.
Algorithmic stabilization is critical: standard PPO practices (entropy bonus, shared backbones, typical learning rates) lead to catastrophic collapse when fine-tuning large pre-trained transformers.
Sim-to-real transfer is highly effective: a policy fine-tuned entirely in simulation with DinoV2 features and domain randomization transfers to real robots with >80% success rates.
FLaRe enables generalization to new tasks (e.g., relative attribute navigation) and new embodiments (e.g., different action spaces) with minimal fine-tuning time.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Actor-Critic)
Behavior Cloning (BC)
Transformers in Vision/Robotics
Sim-to-Real Transfer

Key Terms

BC: Behavior Cloning—supervised learning where a policy is trained to mimic expert demonstrations.

PPO: Proximal Policy Optimization—an on-policy reinforcement learning algorithm that stabilizes training by limiting how much the policy can change at each step.

Sparse Reward: A reward signal given only upon task completion (e.g., +1 for success, 0 otherwise), as opposed to dense rewards that give feedback at every step.

Sim-to-Real: The process of transferring a policy trained in a physics simulator to a physical robot, often requiring domain randomization to handle visual/physical discrepancies.

KV-cache: Key-Value cache—a technique to speed up transformer inference by storing and reusing calculations for previous tokens.

SPOC: The specific multi-task mobile manipulation foundation model (based on a transformer architecture) that FLaRe uses as its starting point.

DinoV2: A computer vision foundation model used to extract robust visual features that generalize well between simulation and reality.

ProcTHOR: A framework for procedurally generating diverse simulated 3D environments (houses) for robot training.

Actor-Critic: An RL architecture with two networks: an Actor (decides actions) and a Critic (estimates value of states).

On-policy: RL algorithms (like PPO) that learn strictly from data collected by the current version of the policy, ensuring stability but lower sample efficiency.