GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages based on the relative rewards of a group of trajectories for the same input
LRM: Large Reasoning Model—LLMs specifically optimized for complex reasoning tasks (e.g., QwQ-32B)
Search Intelligence: The ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration to resolve conflicts
append-only prompting: A history management technique where all new actions and observations are simply added to the end of the context window
fuzzing: A data synthesis technique where specific details in a seed question are obscured or generalized to increase difficulty and uncertainty
injection: A data synthesis technique where external facts are inserted into a seed question to enrich context and complexity
Pass@4: A metric evaluating if the correct answer is present within 4 attempts/samples
Avg@4: The average score across 4 samples