Self-Improving Embodied Foundation Models

📝 Paper Summary

Robotics Foundation Models Reinforcement Learning for Robotics Self-Improving Systems

This paper enables robots to teach themselves new skills by first learning to predict how close they are to a goal, then using that prediction as a reward signal to practice without human help.

Core Problem

Training robots currently relies heavily on imitating humans (behavioral cloning), which requires massive amounts of expensive demonstration data and limits robots to only copying what they have seen.

Why it matters:

Imitation learning is data-inefficient; slight improvements require exponentially more human demonstrations
Manually designing reward functions for every possible robot task in the real world is impossible (untenable engineering effort)
Current methods struggle to generalize behaviors to new skills beyond the exact scenarios in the training data

Concrete Example: In the LanguageTable task, simply increasing human imitation data by 8x only improves success from 45% to 60%. The robot struggles to adjust when it fails or encounters a scenario slightly different from the human demos.

Key Novelty

Two-Stage Self-Improvement via Steps-to-Go

Stage 1 (Supervised): The robot learns to copy human actions AND predict how many steps remain until the goal is reached (steps-to-go)
Stage 2 (Self-Improvement): The robot uses its own 'steps-to-go' prediction as a reward signal. If an action reduces the estimated steps remaining, it gets a positive reward, allowing it to practice autonomously.

Architecture

Pseudocode for the Stage 2 Self-Improvement loop.

Evaluation Highlights

Self-Improvement with just 10% additional autonomous practice improves success rates from 45% to 75% on real-world LanguageTable, compared to 8x more human data yielding only 60%
Achieves ~87-88% success rate on real-world LanguageTable tasks using only 20% of the original imitation dataset plus self-improvement
Demonstrates ability to acquire novel skills not present in the imitation data, generalizing beyond semantic changes to behavioral changes

Breakthrough Assessment

9/10

Significantly outperforms scaling human data (the dominant paradigm) by using autonomous self-improvement. Solves the reward engineering bottleneck by learning rewards from data, enabling scalable real-world robot learning.

⚙️ Technical Details

Problem Definition

Setting: Post-training Embodied Foundation Models (EFMs) via Online Reinforcement Learning

Inputs: Observation o_t (image + text instruction), current policy pi

Outputs: Action a_t (tokenized robot control command)

Pipeline Flow

Input Processing (Image + Text)
EFM Policy Backbone
Action Decoding

System Modules

Vision-Language Encoder

Encodes current observation (images) and goal (text instruction) into embeddings

Model or implementation: PaLI-3B (Pretrained Vision-Language Model)

Policy Head / Decoder

Autoregressively predicts action tokens based on embeddings

Model or implementation: PaLI Decoder

Action Detokenizer

Converts token sequence into continuous robot control commands

Model or implementation: RT-2 style detokenization

Novel Architectural Elements

Dual-head objective integration: The same EFM architecture predicts both actions (for control) and steps-to-go (for self-improvement reward calculation) during training, though inference only requires the action head.

Modeling

Base Model: PaLI-3B

Training Method: Two-stage: 1) Supervised Fine-Tuning (SFT) with Behavioral Cloning + Steps-to-go prediction, 2) Online Reinforcement Learning (Self-Improvement) using REINFORCE

Objective Functions:

Purpose: Mimic human actions in dataset.

Formally: L_BC: Maximize log-likelihood of dataset action a_t conditioned on observation o_t and goal g
Purpose: Learn to predict distance to goal for use as reward later.

Formally: L_steps-to-go: Minimize difference between predicted steps d(o, g) and actual steps t'-t
Purpose: Improve policy using self-generated rewards.

Formally: L_RL (REINFORCE): Maximize expected return R_t = sum(gamma^(i-t) * r(...)) where r is derived from improvement in steps-to-go prediction

Adaptation: Full fine-tuning of PaLI-3B

Trainable Parameters: 3 Billion

Training Data:

LanguageTable dataset (simulated and real)
Aloha dataset (simulated and real)

Key Hyperparameters:

gamma (discount factor): 0.9
c (reward scaling): 5e-2
rl_algorithm: REINFORCE
+ 2 more
policy_updates_per_round: Not explicitly reported in the paper
batch_size: sampled minibatches from replay buffer

Compute: Not reported in the paper

Comparison to Prior Work

vs. RT-2: Adds a second stage of Online RL using self-computed rewards, rather than stopping at Supervised Fine-Tuning
vs. RLHF: Uses 'steps-to-go' as a dense reward proxy instead of a learned reward model trained on human preferences
vs. Standard RL: Learns reward function from data (steps-to-go) rather than requiring manual reward engineering; leverages pre-trained Foundation Model priors

Limitations

Relies on the assumption that the steps-to-go prediction is accurate enough to serve as a reward signal
Requires an initial imitation dataset to bootstrap the steps-to-go predictor
On-policy REINFORCE can be sample inefficient compared to off-policy methods (though paper claims high efficiency relative to BC baselines)
Success detection heuristic relies on model stability (very small steps-to-go prediction)

Reproducibility

Code: https://self-improving-efms.github.io

Project website (self-improving-efms.github.io) provides a self-contained Colab notebook for a pointmass navigation toy example. Real-world robot code/weights are not explicitly linked in the text. The specific PaLI-3B model weights are proprietary/internal to Google.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation tasks in Simulation and Real World

Benchmarks:

LanguageTable (Simulated) (Language-conditioned manipulation (pushing blocks))
LanguageTable (Real World) (Language-conditioned manipulation (pushing blocks))
Aloha (Simulated & Real) (Bimanual manipulation (insertion))

Metrics:

Success Rate
Statistical methodology: 3 random seeds used for simulated experiments to validate reliability

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results on LanguageTable showing that Self-Improvement (Stage 2) consistently improves over Behavioral Cloning (Stage 1) and is highly sample efficient.
LanguageTable (Sim) - 10% Dataset	Success Rate	20.0	58.0	+38.0
LanguageTable (Sim) - 80% Dataset	Success Rate	45.0	78.0	+33.0
Real-world results on LanguageTable confirming simulation findings: Self-Improvement significantly boosts performance with minimal extra robot time.
LanguageTable (Real) - 20% Dataset	Success Rate	63.0	88.0	+25.0
LanguageTable (Real) - 80% Dataset	Success Rate	62.0	87.0	+25.0
Sample efficiency comparison: Self-Improvement vs. Scaling Human Data.
LanguageTable	Success Rate	60.0	75.0	+15.0

Experiment Figures

Success rates on LanguageTable (Sim and Real) comparing Stage 1 (BC) vs Stage 2 (Self-Improvement) across different dataset sizes.

Trajectory visualizations on a Pointmass toy domain.

Main Takeaways

Self-Improvement is significantly more sample-efficient than collecting more human demonstrations (10% extra autonomous time > 8x more human data).
The method works in the real world with minimal human supervision (one operator for multiple robots) because it learns its own success detectors and rewards.
The combination of Pretraining + Self-Improvement is critical; ablations show pretraining enables the sample efficiency.
Unlocks 'Behavioral Generalization': robots can learn skills (e.g., separating blocks) that were not present in the original imitation dataset, going beyond simple semantic generalization.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy Gradients, Value Functions)
Imitation Learning / Behavioral Cloning (BC)
Vision-Language Models (VLMs) / Foundation Models

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

EFM: Embodied Foundation Model—a large pretrained model (like a VLM) fine-tuned to output robot actions

SFT: Supervised Fine-Tuning—training a model on labeled examples (here, human demonstrations) before applying reinforcement learning

Steps-to-go: A predicted scalar value estimating the number of timesteps remaining until a goal is achieved from the current state

REINFORCE: A basic policy gradient algorithm in Reinforcement Learning that updates policies based on the return (total reward) of a trajectory

Monte Carlo returns: The actual sum of rewards received from a specific time step until the end of an episode, used to estimate the value of a state-action pair

Behavioral Cloning: A supervised learning approach where a robot learns a policy by strictly mimicking expert (human) demonstrations

PaLI: Pathways Language and Image model—a large vision-language model architecture used as the backbone for the robot policy

RT-2: Robotic Transformer 2—a specific method for turning VLMs into robot policies by tokenizing actions as text

On-policy: An RL setting where the data used for training comes from the current version of the policy being optimized, rather than historical data

Deadly Triad: The instability caused in RL when combining Function Approximation, Bootstrapping (using estimates to update estimates), and Off-Policy learning