Fine-tuning Smaller Language Models for Question Answering over Financial Documents

📝 Paper Summary

Financial Question Answering Numerical Reasoning in LLMs Knowledge Distillation

Small language models fine-tuned on financial reasoning programs generated by GPT-4 can achieve performance comparable to the teacher model by improving concept understanding and entity extraction.

Core Problem

Financial question answering requires complex numerical reasoning and domain knowledge, typically necessitating very large, computationally expensive models like GPT-4.

Why it matters:

Deploying massive models (hundreds of billions of parameters) is costly and computationally inefficient for scale
Generic reasoning capabilities in small models often fail on domain-specific financial nuances and structured data formats
Previous methods for inducing reasoning in small models haven't fully explored the specific requirements of the financial domain

Concrete Example: A base Orca-2-7B model often fails to produce executable code or writes descriptive formulas without mathematical representation when asked a financial question. In contrast, the fine-tuned version correctly identifies the formula and extracts entities to execute the calculation.

Key Novelty

Financial Reasoning Distillation via Program of Thought (PoT)

Uses GPT-4 (teacher) to generate Python programs that explicitly encode financial reasoning steps (concept, formula, entities) via few-shot PoT prompting
Filters training data by executing teacher code to ensure correctness, then fine-tunes small student models (Phi-3, Mistral, Orca-2) on these verifiable reasoning traces
Proposes a novel evaluation method using GPT-4 to grade the 'concept understanding' of student models on a 5-point scale

Architecture

The 3-step fine-tuning pipeline: Code Generation, Data Curation, and Fine-tuning

Evaluation Highlights

Fine-tuned phi-3-medium achieves accuracy within 1% of the teacher model (GPT-4) on the FinQA dataset
Small models outperform GPT-3.5 Turbo by 4%-10% on FinQA after fine-tuning
Fine-tuning with just 1,500 samples (vs full dataset) yields performance within 3-8% of the fully trained model, showing high data efficiency

Breakthrough Assessment

7/10

Strong empirical evidence that SLMs can rival GPT-4 in niche domains via code-based reasoning. The proposed evaluation for 'concept accuracy' using LLMs is a useful methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Financial Question Answering (QA) over hybrid text and table content

Inputs: Financial report context (text/table) and a natural language question

Outputs: Executable Python code that computes the numerical answer

Pipeline Flow

Input Processing (Prompt formatting)
Code Generation (Student Model)
Execution (External Python Interpreter)
Answer Extraction

System Modules

Student Model

Generate Python code encapsulating financial reasoning

Model or implementation: Phi-3 (mini/medium), Mistral-7B, or Orca-2 (7B/13B)

Python Executor

Execute the generated code to obtain the numerical answer

Model or implementation: Standard Python Interpreter

Novel Architectural Elements

None (Architecture is standard LLM + external tool usage; novelty lies in the fine-tuning methodology and evaluation framework)

Modeling

Base Model: Phi-3 (3.5B/14B), Mistral-7B, Orca-2 (7B/13B)

Training Method: Supervised Fine-Tuning (SFT) with Low-Rank Adaptation (LoRA)

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Next-token prediction cross-entropy loss.

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA adapters only

Training Data:

Generated by GPT-4 using 4-shot PoT prompts
Filtered for correctness by executing code and comparing to ground truth
Formatted into prompt-completion pairs (e.g., [INST]...[/INST] for Mistral)

Compute: Training/Inference on A100 GPU (80GB), 24 cores, 220GB RAM

Comparison to Prior Work

vs. Phogat et al.: Uses fine-tuning to specialize small models rather than prompting large ones
vs. Theuma and Shareghi: Uses teacher-generated CoT/PoT data for fine-tuning rather than predetermined templates

Limitations

Dependency on GPT-4 for generating high-quality training data
Evaluation limited to arithmetic-type questions in TATQA
Analysis restricted to English language financial datasets

Reproducibility

Prompt templates for GPT-4 (teacher) provided in Appendix. Base models are open-source. Code for fine-tuning pipeline and trained weights not explicitly provided.

📊 Experiments & Results

Evaluation Setup

Financial Question Answering with numerical reasoning

Benchmarks:

FinQA (QA over financial reports (text + table))
ConvFinQA (Conversational QA over financial reports)
TATQA (QA requiring arithmetic reasoning over tabular/text data)

Metrics:

Execution Accuracy (Exact Match of calculated answer)
Concept Accuracy (GPT-4 rated)
Entity Extraction Accuracy
Executable Code Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuned small models achieve competitive performance compared to larger baselines on the FinQA dataset.
FinQA	Execution Accuracy	62.45	69.17	+6.72
FinQA	Execution Accuracy	70.36	69.17	-1.19
TATQA	Execution Accuracy	76.45	77.56	+1.11
Analysis of specific capabilities (Concept Accuracy and Entity Extraction) shows significant improvement after fine-tuning.
FinQA	Concept Accuracy (GPT-4 rated)	28	77	+49
FinQA	Entity Extraction Accuracy	51	89	+38

Main Takeaways

Fine-tuned small language models (SLMs) can match or slightly exceed the teacher model (GPT-4) in specialized financial QA tasks
Fine-tuning primarily improves 'Concept Understanding' (identifying correct formulas) and 'Entity Extraction' (formatting data from tables), rather than just arithmetic
Small datasets are effective: 1,500 samples achieve near-optimal results, suggesting high data efficiency for this domain adaptation
Cross-task transfer is viable: Training on FinQA plus a small sample of ConvFinQA allows the model to adapt to conversational styles effectively

📚 Prerequisite Knowledge

Prerequisites

Language Model Fine-tuning (LoRA)
Program of Thought (PoT) / Chain of Thought (CoT)
Financial Question Answering datasets (FinQA, ConvFinQA, TATQA)

Key Terms

PoT: Program of Thought—a prompting strategy where the model generates executable code (like Python) to solve reasoning steps instead of just text

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small set of added weights while keeping the base model frozen

Teacher-Student: A framework where a large, capable model (Teacher) generates training data to supervise a smaller model (Student)

SFT: Supervised Fine-Tuning—training a model on a labeled dataset of inputs and desired outputs

Entity Extraction: Identifying and isolating specific values (numbers, dates, names) from the text/table needed for calculation

Concept Accuracy: A proposed metric measuring whether a model correctly identifies the necessary financial formula/logic, independent of calculation errors

SLM: Small Language Model—typically models with under ~15 billion parameters (e.g., Mistral-7B, Phi-3)