Can RL Improve Generalization of LLM Agents? An Empirical Study

📝 Paper Summary

Reinforcement Fine-Tuning (RFT) for Agents Agent Generalization Multi-Environment Training

This empirical study reveals that while Reinforcement Fine-Tuning improves agent performance on hard in-domain tasks, transfer to unseen environments is limited by interface mismatches, though sequential training can mitigate forgetting.

Core Problem

Most evaluations of Reinforcement Fine-Tuning (RFT) for agents are restricted to in-domain settings, failing to assess whether improvements generalize to unseen environments with different observations, actions, and background knowledge.

Why it matters:

Real-world agents must operate in novel environments without retraining, but current metrics do not distinguish between learning general reasoning skills versus overfitting to specific environment dynamics
Understanding how RFT affects forgetting and transfer is critical for developing general-purpose agents that can accumulate skills across diverse platforms

Concrete Example: An agent trained on BabyAI (which provides valid action lists at every step) becomes dependent on this guidance; when transferred to WebShop (which offers sparse feedback and no action lists), its performance drops from 28.59 (base) to 10.25 because it cannot generate valid actions without prompts.

Key Novelty

Systematic 3-Axis Evaluation of RFT Generalization

Evaluates RFT agents along three distinct axes: (1) generalization across task difficulty within the same environment, (2) zero-shot transfer to completely unseen environments, and (3) sequential training dynamics.
Identifies that RFT transfer is highly sensitive to the similarity of action spaces and feedback density (e.g., dense vs. sparse rewards), rather than just semantic reasoning capabilities.

Evaluation Highlights

+60.1 points improvement on Hard WebShop tasks using Qwen2.5-7B-Instruct when trained only on Easy tasks, demonstrating strong intra-environment generalization.
+78.62 points improvement on AlfWorld (Held-In) using Qwen2.5-3B-Instruct, but only +4.91 average improvement on unseen environments (Held-Out), showing the gap between specific and general skills.
Sequential training of WebShop then TextCraft (7B model) achieves 82.50 on TextCraft (vs 80.88 single-task) while retaining 86.32 on WebShop (vs 86.50), effectively preventing catastrophic forgetting.

Breakthrough Assessment

7/10

A comprehensive empirical study that challenges the assumption that RFT naturally leads to generalizable agents. It provides crucial insights into the mechanics of transfer and forgetting in agentic RFT.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn decision making formulated as a tuple (U, S, A, O, T, R), where agents interact to maximize cumulative reward.

Inputs: Instruction u and interaction history state s_t = (a_0, o_0, ..., a_{t-1}, o_{t-1})

Outputs: Action a_t (including reasoning trace and environment command)

Pipeline Flow

Agent receives Instruction + History
Agent generates Reasoning + Action
Environment executes Action -> returns Observation + Reward
Agent receives new Observation -> repeats until done

System Modules

Policy Agent

Generate reasoning traces and actions based on current history

Model or implementation: Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct

Environment

Execute action and provide feedback

Model or implementation: Simulator (WebShop, AlfWorld, etc.)

Modeling

Base Model: Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct

Training Method: Reinforcement Fine-Tuning (RFT) using GRPO

Objective Functions:

Purpose: Optimize policy to maximize expected cumulative reward.

Formally: Gradient ascent on J(theta) = E[R(tau)] using GRPO estimated gradients.

Adaptation: Full fine-tuning

Trainable Parameters: All model parameters

Training Data:

Training/Test splits from AgentGym
Easy/Hard splits based on Qwen2.5-7B-Instruct avg@8 performance

Key Hyperparameters:

learning_rate: 1e-6 (3B), 5e-7 (7B)
beta_kl: 0.04
clip_ratio: 0.1
+ 3 more
global_batch_size: 16
episodes_per_prompt: 8
max_response_length: 8192

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RLHF: Applies RL to multi-turn agentic trajectories rather than single-turn responses
vs. Previous Agent RFT work (e.g., Liu et al., 2023): Evaluates cross-environment generalization and sequential transfer, whereas prior work focuses on in-domain performance
vs. Agent-FLAN [not cited in paper]: Agent-FLAN focuses on data mixing for SFT, while this paper investigates RL dynamics across environments

Limitations

Evaluation limited to five specific environments; results may vary for other domains like coding or math agents.
Failure mode analysis relies on GPT-5-mini labeling, which may introduce its own biases.
Study focuses on Qwen2.5 models; generalization to other model families (Llama, Mistral) is not explicitly tested.

Reproducibility

Code: https://github.com/woooodyy/AgentGym-RL

Code publicly available at https://github.com/woooodyy/AgentGym-RL. Data statistics and action spaces detailed in Appendix B. Model checkpoints not explicitly mentioned as released.

📊 Experiments & Results

Evaluation Setup

Multi-turn agent interaction across 5 environments (WebShop, SearchQA, TextCraft, AlfWorld, BabyAI).

Benchmarks:

WebShop (Web navigation / e-commerce)
SearchQA (Search-augmented QA)
TextCraft (Minecraft crafting game)
AlfWorld (Embodied household tasks)
BabyAI (Gridworld navigation)

Metrics:

avg@8 (Success Rate)
Average Interaction Turns
Average Generated Tokens
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Intra-environment experiments show RFT generalizes well from easy tasks to hard tasks within the same domain.
WebShop (Hard Tasks)	avg@8	17.4	77.5	+60.1
AlfWorld (Hard Tasks)	avg@8	14.5	49.4	+34.9
Inter-environment experiments reveal transfer is possible but highly asymmetric and environment-dependent.
AlfWorld (Held-In)	avg@8	13.19	91.81	+78.62
Average Held-Out Envs	avg@8	29.28	34.19	+4.91
WebShop	avg@8	28.59	10.25	-18.34
Sequential training experiments demonstrate that agents can learn new tasks without forgetting old ones.
TextCraft (Downstream)	avg@8	80.88	82.50	+1.62
WebShop (Upstream)	avg@8	86.50	86.32	-0.18

Main Takeaways

RFT significantly improves interaction efficiency (reducing steps and tokens) alongside success rate within the same environment.
Generalization to unseen environments is strongly correlated with the similarity of action spaces and feedback mechanisms; transfer from sparse-reward environments (SearchQA) is better than from dense-reward ones (BabyAI).
Sequential training allows agents to accumulate capabilities across environments with minimal forgetting, often matching or exceeding single-task performance.
Failure mode analysis shows 'Confirmation Bias' (overconfidence without verification) is a persistent error pattern across all trained agents.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient)
LLM Agent Frameworks (ReAct)
Curriculum Learning

Key Terms

RFT: Reinforcement Fine-Tuning—training an LLM agent using reinforcement learning (like PPO or GRPO) on interaction trajectories to optimize task completion rewards

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages of multiple sampled trajectories for the same input to reduce variance

ReAct: Reason+Act—a prompting paradigm where the agent generates a reasoning trace (thought) before generating the actual action to execute

Held-In: Environments or tasks that were seen during the training phase

Held-Out: Environments or tasks that were NOT seen during training, used to test generalization

Curriculum Learning: Training on easier tasks first before moving to harder tasks to improve convergence and final performance

avg@8: The average success rate when sampling 8 trajectories per task instruction and checking if any are successful (or averaging the reward)