← Back to Paper List

Reinforcement Learning for Flow-Matching Policies

Samuel Pfrommer, Yixiao Huang, Somayeh Sojoudi
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
arXiv (2025)
RL MM Agent

📝 Paper Summary

Robotic Control Generative Models for Action Reinforcement Learning
This paper enables flow-matching policies to generate variable-duration trajectories and optimizes them via Group Relative Policy Optimization (GRPO) to significantly outperform suboptimal human demonstrations.
Core Problem
Robotic policies trained via imitation learning inherit the suboptimality and inconsistency of human demonstrators, while existing diffusion-based planners are constrained to inefficient fixed-horizon action chunks.
Why it matters:
  • Human demonstrations are often slow and variable, limiting the ceiling of imitation-based policies
  • Fixed-horizon planning forces robots to make unnecessary back-and-forth movements to consume superfluous time or fails on tasks requiring longer horizons
  • Standard reinforcement learning for diffusion models is computationally prohibitive due to expensive likelihood estimation (ELBO)
Concrete Example: A robot using a fixed 2-second planning horizon might need only 0.5 seconds to reach a target. Because the planner is forced to output a 2-second trajectory, the robot performs unnecessary 'wiggle' or slow motion to fill the remaining 1.5 seconds, resulting in inefficient control.
Key Novelty
Variable-Horizon RL for Flow Matching
  • Augments flow-matching models to accept a time horizon input, allowing the policy to dynamically predict and generate trajectories of varying durations rather than a fixed length
  • Adapts Group Relative Policy Optimization (GRPO) to flow-matching policies, using a learned reward surrogate to optimize behavior without expensive value function training or likelihood computation
Evaluation Highlights
  • GRPO approach incurs between 50% and 85% less cost (time/actuation) than naive Imitation Learning Flow Matching (ILFM) on simulated unicycle tasks
  • Successfully enables minimum-time control in flow-matching policies, a capability incompatible with standard fixed-horizon Vision-Language-Action (VLA) models
Breakthrough Assessment
7/10
Addresses a critical inefficiency in modern VLA models (fixed horizons) and successfully applies GRPO to continuous flow-matching control. High claimed cost reduction (50-85%), though evaluated on simulated unicycle dynamics rather than real-world hardware.
×