← Back to Paper List

Agentic Critical Training

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang
University of Maryland, College Park
arXiv (2026)
Agent RL Reasoning

📝 Paper Summary

Self-evolving Agentic reasoning RL-based Agent Training
Agentic Critical Training (ACT) replaces the imitation of reflection text with an RL-based objective where agents must correctly judge the superior action among alternatives, forcing the autonomous development of critical reasoning.
Core Problem
Imitation Learning (IL) teaches agents what to do but not why, leaving them brittle in suboptimal states; current 'reflection' methods merely train agents to imitate pre-generated critique text rather than developing genuine reasoning capabilities.
Why it matters:
  • Agents trained via IL cannot recover from failures because they never observe suboptimal actions or understand the causal link between actions and outcomes
  • Approaches that treat reflection as a supervised learning task (imitating text) fail to internalize the reasoning process, leading to superficial 'thoughts' that don't improve decision-making
  • Without true critical reasoning, agents struggle to generalize to out-of-distribution tasks where memorized expert trajectories do not apply
Concrete Example: In ALFWorld, when an action fails and the environment returns 'Nothing happens,' a standard IL agent enters an infinite loop repeating the same failed command. In contrast, an ACT-trained agent analyzes the failure via internal reasoning, diagnoses the issue, and issues a correct alternative command.
Key Novelty
Reinforcement Learning for Action Discrimination (ACT)
  • Construct 'preference pairs' at each step containing one expert action and one suboptimal model-generated action
  • Train the agent via RL (GRPO) to identify which action is better, rewarding only the correct judgment
  • Force the model to autonomously generate Chain-of-Thought reasoning to maximize the judgment reward, rather than supervising it to copy a teacher's reasoning trace
Architecture
Architecture Figure Figure 1
Illustration of the Agentic Critical Training (ACT) paradigm compared to Imitation Learning.
Evaluation Highlights
  • Achieves an average improvement of 5.07 points over Imitation Learning and 4.62 points over Reinforcement Learning across three benchmarks (ALFWorld, WebShop, ScienceWorld)
  • Outperforms 'Early Experience' (a baseline that uses supervised learning to imitate reflection text) by 2.42 points on average, validating the RL-based critique approach
  • Improves GPQA-Diamond accuracy to 53.37% (+1.85pp over CoT prompting), whereas Imitation Learning causes a collapse to 44.61%, demonstrating transfer to general reasoning without domain-specific data
Breakthrough Assessment
8/10
Strong methodological shift from imitating reflection to learning it via RL. Significant empirical gains across agentic and general reasoning benchmarks, demonstrating that agentic environments can foster general intelligence.
×