MCTS: Monte Carlo Tree Search—a search algorithm that balances exploring new actions and exploiting known good ones to find optimal paths
DPO: Direct Preference Optimization—an offline training method that aligns models to preferences (e.g., success > failure) without a separate reward model
Self-Critique: A mechanism where the LLM evaluates its own intermediate outputs (thoughts or actions) to provide feedback/reward signals
POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot see the full state of the world
DOM: Document Object Model—the programming interface for HTML documents that represents the page structure as a tree
PlanReAct: An agent architecture that first generates a high-level plan, then executes reasoning and actions step-by-step
RFT: Reinforced Fine-Tuning—a method that filters for high-quality trajectories and applies supervised fine-tuning, ignoring negative data
Behavior Cloning: Training an agent by strictly imitating expert demonstrations
Process Reward: Feedback given at intermediate steps of a task, rather than just at the final outcome