← Back to Paper List

Can RL Improve Generalization of LLM Agents? An Empirical Study

Zhiheng Xi, Xin Guo, Jiaqi Liu, Jiazheng Zhang, Yutao Fan, Zhihao Zhang, Shichun Liu, Mingxu Chai, Xiaowei Shi, Yitao Zhai, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
Fudan University, Meituan, Shanghai Artificial Intelligence Laboratory
arXiv (2026)
Agent RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Fine-Tuning (RFT) for Agents Agent Generalization Multi-Environment Training
This empirical study reveals that while Reinforcement Fine-Tuning improves agent performance on hard in-domain tasks, transfer to unseen environments is limited by interface mismatches, though sequential training can mitigate forgetting.
Core Problem
Most evaluations of Reinforcement Fine-Tuning (RFT) for agents are restricted to in-domain settings, failing to assess whether improvements generalize to unseen environments with different observations, actions, and background knowledge.
Why it matters:
  • Real-world agents must operate in novel environments without retraining, but current metrics do not distinguish between learning general reasoning skills versus overfitting to specific environment dynamics
  • Understanding how RFT affects forgetting and transfer is critical for developing general-purpose agents that can accumulate skills across diverse platforms
Concrete Example: An agent trained on BabyAI (which provides valid action lists at every step) becomes dependent on this guidance; when transferred to WebShop (which offers sparse feedback and no action lists), its performance drops from 28.59 (base) to 10.25 because it cannot generate valid actions without prompts.
Key Novelty
Systematic 3-Axis Evaluation of RFT Generalization
  • Evaluates RFT agents along three distinct axes: (1) generalization across task difficulty within the same environment, (2) zero-shot transfer to completely unseen environments, and (3) sequential training dynamics.
  • Identifies that RFT transfer is highly sensitive to the similarity of action spaces and feedback density (e.g., dense vs. sparse rewards), rather than just semantic reasoning capabilities.
Evaluation Highlights
  • +60.1 points improvement on Hard WebShop tasks using Qwen2.5-7B-Instruct when trained only on Easy tasks, demonstrating strong intra-environment generalization.
  • +78.62 points improvement on AlfWorld (Held-In) using Qwen2.5-3B-Instruct, but only +4.91 average improvement on unseen environments (Held-Out), showing the gap between specific and general skills.
  • Sequential training of WebShop then TextCraft (7B model) achieves 82.50 on TextCraft (vs 80.88 single-task) while retaining 86.32 on WebShop (vs 86.50), effectively preventing catastrophic forgetting.
Breakthrough Assessment
7/10
A comprehensive empirical study that challenges the assumption that RFT naturally leads to generalizable agents. It provides crucial insights into the mechanics of transfer and forgetting in agentic RFT.
×