Next-state signal: The immediate feedback following an agent's action, such as a user reply, tool execution result, or GUI state change
PRM: Process Reward Model—a model that evaluates intermediate steps (actions) rather than just the final outcome
OPD: Hindsight-Guided On-Policy Distillation—a method where the model learns from a 'teacher' version of itself that has been augmented with a textual hint from the future (next state)
RLVR: Reinforcement Learning with Verifiable Rewards—RL applied to tasks where the outcome can be programmatically checked (e.g., code compilation)
SWE: Software Engineering—referring here to agents that perform coding tasks
PPO: Proximal Policy Optimization—an RL algorithm that updates policies using a clipped objective to ensure stability
Binary RL: A method converting evaluative signals into simple +1/-1 scalar rewards
SGLang: A serving framework for Large Language Models used here for efficient inference
Megatron: A framework for large-scale model training