Evaluation Setup
Continuous control tasks including GridWorld and MuJoCo physics simulation
Benchmarks:
- GridWorld (Toy navigation)
- MuJoCo Playground (Continuous control (locomotion))
- Humanoid Control (High-dimensional continuous control)
Metrics:
- Cumulative Reward
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- FPO successfully trains flow-based policies from scratch, avoiding the need for behavior cloning initialization typical in diffusion policy works.
- Flow-based policies learn multimodal action distributions in ambiguous states (GridWorld), whereas Gaussian policies collapse to a single (often suboptimal) mean.
- In under-conditioned humanoid control (root-only commands), FPO learns viable walking behaviors where Gaussian policies struggle, demonstrating superior expressivity.