Early Experience: A training paradigm where agents learn from the future states generated by their own actions, using them as supervision without external rewards.
Implicit World Modeling: Training the policy to predict the next state (token sequence) given a current state and action, helping it internalize environment dynamics.
Self-Reflection: A method where the agent compares its own sampled action to an expert action, using the observed outcomes to generate a natural language explanation of why the expert choice was better.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to validate that early experience provides a better starting point for RL.
SFT: Supervised Fine-Tuning—training a model on expert demonstrations (also called Imitation Learning).
Rollout: A sequence of interactions generated by the agent acting in the environment.
DOM: Document Object Model—the structural representation of a webpage used in web navigation tasks.
Chain-of-Thought: Intermediate reasoning steps generated by the model before producing the final action.