RLVR: Reinforcement Learning with Verifiable Reward—using objective outcomes (like passing unit tests) as the reward signal for RL training
Issue-free: A training setup where the natural language problem description is removed, forcing the agent to identify the bug using only the provided codebase and failing test cases
Entropy-aware clipping: Modifying the PPO trust region size based on the model's predictive uncertainty (entropy); high entropy allows larger updates (exploration), low entropy enforces smaller updates (stability)
ReAct: Reason+Act—a paradigm where the model alternates between generating a thought (reasoning trace) and executing an action (tool call)
SWE-bench Verified: A subset of the SWE-bench dataset containing real-world GitHub issues and pull requests, filtered for quality and reproducibility
RLOO: Reward Leave-One-Out—a baseline variance reduction technique for policy gradient methods where the baseline for a sample is the mean reward of other samples in the batch
TTS: Test-Time Scaling—generating multiple candidate solutions at inference time and selecting the best one (often via voting or test execution)
PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to a 'trust region' to prevent catastrophic forgetting or instability
SFT: Supervised Fine-Tuning—training the model on expert demonstration trajectories before applying RL