SFT: Supervised Fine-Tuning—training a model on high-quality examples of inputs and desired outputs
DPO: Direct Preference Optimization—an algorithm for aligning models to human preferences without training a separate reward model
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies based on group-level performance comparisons
DAPO: An RL algorithm (likely Dual/Direct Alignment Policy Optimization) used here in the production pipeline for reasoning
RLVR: Reinforcement Learning with Verifiable Rewards—training models on tasks where the final answer can be programmatically checked (e.g., math, code)
Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer
RAG: Retrieval-Augmented Generation—fetching relevant external data to ground the model's generation
SymPy: A Python library for symbolic mathematics, used here by the Executor agent to perform exact calculations