PPO: Proximal Policy Optimization—an RL algorithm that updates policies constrained by a trust region to ensure stability
Cross-Policy Sampling: A strategy where actions are sampled not just from the current policy but from a diverse pool of policies (including older versions or different models) to encourage exploration
Task Advantage Normalization: Normalizing the advantage values (how much better an action is than expected) specifically within each task's statistics to prevent tasks with large raw rewards from dominating the gradient
Asynchronous Pipeline: A system design where data generation (rollout) and model training happen in parallel processes connected by a buffer, rather than waiting for each other
Advantage: In RL, a value measuring how much better a specific action is compared to the average action in that state
On-policy: RL algorithms that require data generated by the *current* version of the model being trained (strictly)
Off-policy: RL algorithms that can learn from data generated by older or different policies
V-Trace: A correction method used in off-policy RL to adjust for the difference between the behavior policy (that generated data) and the target policy (being learned)
AutoGLM: A foundation agent framework mentioned as utilizing the AgentRL system
Containerized Environment: Running agent tasks (like web browsing) inside isolated Docker containers to ensure safety and reproducibility