ACT: Agentic Critical Training—the proposed paradigm where agents learn to discriminate between expert and suboptimal actions via RL
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a sampled group of outputs to stabilize training without a separate critic network
POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Early Experience: A baseline method that executes expert and alternative actions, generates reflection text comparing them, and trains the model to imitate this text
IL: Imitation Learning—training an agent to replicate expert demonstrations using supervised learning (next-token prediction)
RLVR: Reinforcement Learning with Verifiable Rewards—using objective outcomes (like correct/incorrect) to guide RL training
OOD: Out-of-Distribution—tasks or environments that differ significantly from those seen during training (e.g., unseen room layouts in ALFWorld)