HTML Pruner: An algorithm that simplifies raw HTML by removing redundant elements and compressing the tree structure while preserving semantic meaning.
Self-Sampling RL: A reinforcement learning approach where the model generates its own training data by attempting tasks; successful attempts become positive examples, and consistently failed attempts become negative examples.
RFT: Rejection Sampling Finetuning—a method where the model generates multiple reasoning paths, and only the correct ones are kept for further supervised training.
Curriculum Learning: A training strategy where the model is trained on progressively harder tasks, starting from simple element recognition to complex multi-step workflows.
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without explicitly training a reward model, used here to discourage failed trajectories.
OCR: Optical Character Recognition—technology to convert images of text into machine-encoded text, used here to identify text elements on webpages.
SFT: Supervised Fine-Tuning—training a model on labeled examples (demonstrations) to establish baseline capabilities.
AutoWebBench: A bilingual (English and Chinese) benchmark dataset constructed by the authors for evaluating real-world web navigation tasks.