GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from a group of sampled responses to optimize policies without a critic network
ARTIST: Agentic Reasoning and Tool Integration in Self-Improving Transformers—the proposed framework for training agentic LLMs via RL
Chain of Thought (CoT): A prompting/reasoning technique where models generate intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training models on labeled datasets of inputs and target outputs
SymPy: A Python library for symbolic mathematics, used here as a tool for the model
BFCL: Berkeley Function Calling Leaderboard—a benchmark for evaluating LLM tool-use capabilities
Tau-bench: A benchmark for evaluating agents in dynamic, multi-turn scenarios
Loss Masking: A training technique where loss is calculated only on specific tokens (e.g., model reasoning) and ignored on others (e.g., deterministic tool outputs)