Agentic Critical Training

📝 Paper Summary

Self-evolving Agentic reasoning RL-based Agent Training

Agentic Critical Training (ACT) replaces the imitation of reflection text with an RL-based objective where agents must correctly judge the superior action among alternatives, forcing the autonomous development of critical reasoning.

Core Problem

Imitation Learning (IL) teaches agents what to do but not why, leaving them brittle in suboptimal states; current 'reflection' methods merely train agents to imitate pre-generated critique text rather than developing genuine reasoning capabilities.

Why it matters:

Agents trained via IL cannot recover from failures because they never observe suboptimal actions or understand the causal link between actions and outcomes
Approaches that treat reflection as a supervised learning task (imitating text) fail to internalize the reasoning process, leading to superficial 'thoughts' that don't improve decision-making
Without true critical reasoning, agents struggle to generalize to out-of-distribution tasks where memorized expert trajectories do not apply

Concrete Example: In ALFWorld, when an action fails and the environment returns 'Nothing happens,' a standard IL agent enters an infinite loop repeating the same failed command. In contrast, an ACT-trained agent analyzes the failure via internal reasoning, diagnoses the issue, and issues a correct alternative command.

Key Novelty

Reinforcement Learning for Action Discrimination (ACT)

Construct 'preference pairs' at each step containing one expert action and one suboptimal model-generated action
Train the agent via RL (GRPO) to identify which action is better, rewarding only the correct judgment
Force the model to autonomously generate Chain-of-Thought reasoning to maximize the judgment reward, rather than supervising it to copy a teacher's reasoning trace

Architecture

Illustration of the Agentic Critical Training (ACT) paradigm compared to Imitation Learning.

Evaluation Highlights

Achieves an average improvement of 5.07 points over Imitation Learning and 4.62 points over Reinforcement Learning across three benchmarks (ALFWorld, WebShop, ScienceWorld)
Outperforms 'Early Experience' (a baseline that uses supervised learning to imitate reflection text) by 2.42 points on average, validating the RL-based critique approach
Improves GPQA-Diamond accuracy to 53.37% (+1.85pp over CoT prompting), whereas Imitation Learning causes a collapse to 44.61%, demonstrating transfer to general reasoning without domain-specific data

Breakthrough Assessment

8/10

Strong methodological shift from imitating reflection to learning it via RL. Significant empirical gains across agentic and general reasoning benchmarks, demonstrating that agentic environments can foster general intelligence.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) formulated as a sequential decision-making task

Inputs: Context history s_t containing task description, recent observation-action pairs, and current observation

Outputs: Action a_t (or a selection between two candidate actions during ACT phase)

Pipeline Flow

Data Construction (Pairing expert vs. model actions)
Stage 1: Agentic Critical Training (RL on discrimination)
Stage 2: RL Action Training (RL on generation)

System Modules

Data Constructor

Create preference pairs for training

Model or implementation: Initial Policy π_θ0

Critic Agent (ACT Phase)

Learn to reason about action quality

Model or implementation: Qwen3-8B (updated via GRPO)

Actor Agent (Action Phase)

Generate actions for the task

Model or implementation: Qwen3-8B (initialized from ACT phase, updated via GRPO)

Novel Architectural Elements

Two-stage RL pipeline where the first stage optimizes a 'discrimination' objective (A vs B) to internalize critical reasoning, which is then transferred to the 'generation' objective

Modeling

Base Model: Qwen3-8B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Reward correct identification of expert action (ACT Phase).

Formally: R_acc = 1 if selection matches expert, else 0.
Purpose: Reward generation of expert action (Action Phase).

Formally: R_acc = 1 if a == a_expert.
Purpose: Encourage valid formatting and admissible actions.

Formally: R = R_acc + R_adm (0.1 if valid) + R_fmt (-0.5 if invalid format).

Training Data:

Expert trajectories from ALFWorld, WebShop, ScienceWorld
ACT data created by sampling K alternatives from base policy and pairing with expert actions

Compute: Not reported in the paper

Comparison to Prior Work

vs. Reflexion: ACT trains the capability into the model parameters via RL, avoiding test-time dependency on complex prompting frameworks
vs. Early Experience: ACT uses RL to reward the *outcome* of reasoning (correct judgment) rather than imitating the *text* of reasoning, preventing the model from learning superficial patterns
vs. DeepSeek-R1: ACT introduces the specific 'discriminative' pre-training stage (comparing expert vs. alternative) specifically for agents, rather than just outcome-based RL on the final answer

Limitations

Relies on the assumption that expert actions are always superior to model-generated alternatives during data construction
Requires collecting alternative actions from a policy to construct contrastive pairs, which adds computational cost (though Table 2 suggests data reuse is possible)
Performance depends on the quality of the underlying expert trajectories (Imitation Learning bound)

Reproducibility

Code: https://attention-is-all-i-need.github.io/ACT/

Project page available at https://attention-is-all-i-need.github.io/ACT/. Paper provides prompt templates and reward values (R_acc=1, R_adm=0.1, R_fmt=-0.5). Hyperparameters for GRPO (learning rate, batch size) are not explicitly detailed in the main text snippet provided.

📊 Experiments & Results

Evaluation Setup

Sequential decision-making in text-based environments

Benchmarks:

ALFWorld (Embodied household tasks (text-based))
WebShop (Web navigation and shopping)
ScienceWorld (Scientific experimentation and reasoning)
GPQA-Diamond (General scientific reasoning (Out-of-domain evaluation))
MATH-500 (Mathematical reasoning (Out-of-domain evaluation))

Metrics:

Success Rate
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrate that ACT enhances both Imitation Learning and Reinforcement Learning baselines across agentic benchmarks.
Average (ALFWorld, WebShop, ScienceWorld)	Success Rate Improvement	Not reported as single aggregate number in text	Not reported as single aggregate number in text	+5.07
Average (ALFWorld, WebShop, ScienceWorld)	Success Rate Improvement	Not reported as single aggregate number in text	Not reported as single aggregate number in text	+4.62
Average (ALFWorld, WebShop, ScienceWorld)	Success Rate Improvement	Not reported as single aggregate number in text	Not reported as single aggregate number in text	+2.42
Out-of-distribution (OOD) results on general reasoning benchmarks show that ACT improves general intelligence without specific training data, whereas IL degrades it.
GPQA-Diamond	Accuracy	51.52	53.37	+1.85
GPQA-Diamond	Accuracy	44.61	53.37	+8.76

Main Takeaways

ACT consistently improves performance over both Imitation Learning and standard Reinforcement Learning across embodied, web, and scientific agent benchmarks.
Directly training models to judge action quality via RL (ACT) is superior to training them to imitate reflection text (Early Experience), leading to better generalization.
ACT enables 'self-verification' behaviors (e.g., checking answers in math problems) that emerge naturally from the training objective, even without domain-specific reasoning data.
Unlike Imitation Learning, which degrades general reasoning capabilities (e.g., on GPQA), ACT preserves and even enhances general reasoning performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (policy, reward, advantage)
Imitation Learning (IL) / Behavior Cloning
Chain-of-Thought (CoT) prompting

Key Terms

ACT: Agentic Critical Training—the proposed paradigm where agents learn to discriminate between expert and suboptimal actions via RL

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a sampled group of outputs to stabilize training without a separate critic network

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Early Experience: A baseline method that executes expert and alternative actions, generates reflection text comparing them, and trains the model to imitate this text

IL: Imitation Learning—training an agent to replicate expert demonstrations using supervised learning (next-token prediction)

RLVR: Reinforcement Learning with Verifiable Rewards—using objective outcomes (like correct/incorrect) to guide RL training

OOD: Out-of-Distribution—tasks or environments that differ significantly from those seen during training (e.g., unseen room layouts in ALFWorld)