Demystifying Domain-adaptive Post-training for Financial LLMs

📝 Paper Summary

Domain Adaptation Financial Large Language Models Post-training Strategies

FinDaP adapts LLMs to finance by jointly optimizing pre-training and instruction tuning to minimize forgetting, then refining reasoning via preference alignment using stepwise corrective signals.

Core Problem

Existing domain adaptation methods typically perform training stages (CPT, IT, PA) sequentially without optimizing for trade-offs, leading to catastrophic forgetting of general capabilities or poor reasoning on complex domain tasks.

Why it matters:

Sequential training (CPT then IT) often causes the model to lose general instruction-following abilities gained during the base model's original training
Current financial LLMs focus on simple tasks (sentiment analysis) but struggle with complex reasoning (e.g., CFA exams) due to sparse outcome-based supervision
Exclusive reliance on in-domain data exacerbates the loss of general knowledge, creating 'idiot savant' models

Concrete Example: A standard financial LLM trained sequentially might answer a financial concept question correctly but fail to follow a specific formatting instruction (e.g., 'answer in JSON'), whereas a general LLM follows the instruction but misses the domain nuance.

Key Novelty

FinDaP Framework (FinCap, FinRec, FinTrain, FinEval)

**FinRec (Model):** Replaces sequential training with Joint CPT+IT (mixing raw text and masked instructions) to maintain general capabilities while learning domain concepts.
**SCP (Stepwise Corrective Preference):** A novel preference alignment method where a Generative Reward Model identifies and corrects the *first* error in a reasoning chain, creating fine-grained (Error, Correction) training pairs.
**Data Mixing:** Explicitly mixes general-domain and in-domain data across all stages to mitigate catastrophic forgetting.

Evaluation Highlights

0.003% 10-gram contamination rate between FinTrain (training data) and FinEval (evaluation suite), ensuring rigorous testing on unseen data

Breakthrough Assessment

8/10

Offers a comprehensive, principled recipe for domain adaptation that addresses the critical 'forgetting vs. specializing' trade-off. The SCP method for reasoning alignment is a significant methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Domain-adaptive post-training of a general-purpose instruction-tuned LLM

Inputs: Financial domain queries, instructions, or reasoning problems (e.g., CFA exam questions)

Outputs: Domain-accurate, instruction-compliant, and reasoned text responses

Pipeline Flow

User Input (Question/Instruction)
Llama-Fin-8b (Inference)
Response

System Modules

Llama-Fin-8b

Generate financial domain responses

Model or implementation: Llama-3-8B-Instruct (fine-tuned via FinRec)

Modeling

Base Model: Llama-3-8b-instruct

Training Method: Joint CPT+IT followed by DPO (Direct Preference Optimization)

Objective Functions:

Purpose: Jointly learn domain knowledge and maintain instruction following.

Formally: Combination of CPT loss (next-token prediction on raw text) and IT loss (next-token prediction with masked instructions).
Purpose: Align model with process and outcome preferences.

Formally: DPO loss using pairs constructed via FAP (outcome) and SCP (stepwise correction).

Training Data:

CPT: ~6B tokens (Financial websites/books + General data)
IT: ~3M prompts (Financial + General + Reasoning + QA)
PA: ~32K prompts (Derived from CFA materials using GenRM)
Data Mixing: In-domain and General-domain data mixed at all stages

Key Hyperparameters:

statistical_methodology: Not explicitly reported in the paper

Comparison to Prior Work

vs. FinLLM: FinDaP uses *joint* CPT+IT instead of sequential, preventing forgetting
vs. FinTral: FinDaP uses fine-grained *stepwise* process signals (SCP) for PA, whereas FinTral uses only outcome labels
vs. AdaptLLM: FinDaP targets broad capabilities (FinCap) including reasoning and chat, not just specific end-tasks

Limitations

Process reward generation (SCP) relies on a strong Generative Reward Model (GenRM), which may be computationally expensive to query
Performance metrics for the final model are not available in the provided text snippet
Effectiveness of SCP depends on the GenRM's ability to correctly identify and fix reasoning errors

Reproducibility

Code: https://github.com/SalesforceAIResearch/FinDAP

publicly available (https://github.com/SalesforceAIResearch/FinDAP). Datasets (FinTrain, FinEval) and Model (Llama-Fin-8b) are on HuggingFace. Code is on GitHub.

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation using FinEval suite covering 35 tasks across different capabilities

Benchmarks:

FinEval (Diverse suite: General, Domain-Specific, Reasoning) [New]

Metrics:

Accuracy (Direct Answer)
Accuracy (Chain-of-Thought)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FinEval	10-gram contamination rate	Not reported in the paper	0.003%	Not reported in the paper

Main Takeaways

Joint CPT+IT is superior to sequential training for maintaining general capabilities while learning domain concepts (based on method description, numeric results truncated in text)
Process-level supervision (SCP) in Preference Alignment provides denser signals for reasoning tasks compared to outcome-only rewards
Mixing general-domain data with in-domain data is essential throughout all post-training stages to mitigate catastrophic forgetting

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM post-training stages (Pre-training vs. SFT vs. RLHF)
Familiarity with catastrophic forgetting in transfer learning
Basic knowledge of preference optimization algorithms (DPO)

Key Terms

CPT: Continual Pre-training—further training a base model on domain-specific raw text

IT: Instruction Tuning (or SFT)—training on (instruction, response) pairs to learn task execution

PA: Preference Alignment—optimizing the model to prefer higher-quality outputs, typically via RLHF or DPO

DPO: Direct Preference Optimization—a stable method for aligning models to preferences without training a separate reward model explicitly during the policy update

GenRM: Generative Reward Model—an LLM prompted to evaluate or correct responses rather than outputting a scalar score

SCP: Stepwise Corrective Preference—constructing preference data by finding the first error in a reasoning chain and using a model-generated correction as the 'winner'

FAP: Final Answer Preference—constructing preference data based on the correctness of the final answer (outcome reward)

FinCap: The set of identified core capabilities: Domain Concepts, Domain Tasks, Reasoning, and Instruction Following

CFA: Chartered Financial Analyst—a rigorous professional certification in finance, used here as a source of complex reasoning tasks