GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input, using the group average as a baseline
DPO: Direct Preference Optimization—an alignment method that optimizes a policy to prefer chosen responses over rejected ones without training a separate reward model
SFT: Supervised Fine-Tuning—training a model on labeled examples (demonstrations) to teach it a specific format or behavior
Virtual Tools: Text-defined operators (not external APIs) that the model invokes to produce structured intermediate reasoning artifacts (e.g., risk assessment)
Planner: A module that determines the safety strategy (persona, tool subset, topology) based on the input's estimated risk
Responder: The agent that executes the tool trace and generates the final response based on the planner's configuration
Topology: The constrained graph structure defining allowed transitions between virtual tools (e.g., linear, tree, shield)
Over-refusal: A failure mode where a safety-aligned model refuses to answer benign or helpful requests due to over-sensitivity