CPT: Continual Pre-training—further training a base model on domain-specific raw text
IT: Instruction Tuning (or SFT)—training on (instruction, response) pairs to learn task execution
PA: Preference Alignment—optimizing the model to prefer higher-quality outputs, typically via RLHF or DPO
DPO: Direct Preference Optimization—a stable method for aligning models to preferences without training a separate reward model explicitly during the policy update
GenRM: Generative Reward Model—an LLM prompted to evaluate or correct responses rather than outputting a scalar score
SCP: Stepwise Corrective Preference—constructing preference data by finding the first error in a reasoning chain and using a model-generated correction as the 'winner'
FAP: Final Answer Preference—constructing preference data based on the correctness of the final answer (outcome reward)
FinCap: The set of identified core capabilities: Domain Concepts, Domain Tasks, Reasoning, and Instruction Following
CFA: Chartered Financial Analyst—a rigorous professional certification in finance, used here as a source of complex reasoning tasks