SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality instruction-response pairs
GRPO: Group Relative Policy Optimization—an on-policy RL algorithm that normalizes rewards within a group of outputs for the same query, removing the need for a separate value function
DPO: Direct Preference Optimization—an off-policy method that optimizes a model to prefer chosen responses over rejected ones without an explicit reward model
ORPO: Odds Ratio Preference Optimization—a method combining SFT and preference optimization into a single objective to eliminate the reference model
CWE: Common Weakness Enumeration—a community-developed list of software and hardware weakness types
CPG: Code Property Graph—a graph representation of code combining abstract syntax trees, control flow graphs, and program dependence graphs
Joern: A static analysis tool used to generate Code Property Graphs (CPGs) for extracting code context
Rationalization: A data curation method where a teacher model generates reasoning while having access to the ground truth answer (prone to shortcuts)
Rejection Sampling: A data curation method where a teacher model generates multiple responses without ground truth, and only correct ones are kept (prevents hallucinations)