← Back to Paper List

Fine-tuning Smaller Language Models for Question Answering over Financial Documents

KS Phogat, SA Puranam, S Dasaratha, C Harsha…
TCS Research
arXiv, 8/2024 (2024)
Reasoning QA

📝 Paper Summary

Financial Question Answering Numerical Reasoning in LLMs Knowledge Distillation
Small language models fine-tuned on financial reasoning programs generated by GPT-4 can achieve performance comparable to the teacher model by improving concept understanding and entity extraction.
Core Problem
Financial question answering requires complex numerical reasoning and domain knowledge, typically necessitating very large, computationally expensive models like GPT-4.
Why it matters:
  • Deploying massive models (hundreds of billions of parameters) is costly and computationally inefficient for scale
  • Generic reasoning capabilities in small models often fail on domain-specific financial nuances and structured data formats
  • Previous methods for inducing reasoning in small models haven't fully explored the specific requirements of the financial domain
Concrete Example: A base Orca-2-7B model often fails to produce executable code or writes descriptive formulas without mathematical representation when asked a financial question. In contrast, the fine-tuned version correctly identifies the formula and extracts entities to execute the calculation.
Key Novelty
Financial Reasoning Distillation via Program of Thought (PoT)
  • Uses GPT-4 (teacher) to generate Python programs that explicitly encode financial reasoning steps (concept, formula, entities) via few-shot PoT prompting
  • Filters training data by executing teacher code to ensure correctness, then fine-tunes small student models (Phi-3, Mistral, Orca-2) on these verifiable reasoning traces
  • Proposes a novel evaluation method using GPT-4 to grade the 'concept understanding' of student models on a 5-point scale
Architecture
Architecture Figure Figure 1
The 3-step fine-tuning pipeline: Code Generation, Data Curation, and Fine-tuning
Evaluation Highlights
  • Fine-tuned phi-3-medium achieves accuracy within 1% of the teacher model (GPT-4) on the FinQA dataset
  • Small models outperform GPT-3.5 Turbo by 4%-10% on FinQA after fine-tuning
  • Fine-tuning with just 1,500 samples (vs full dataset) yields performance within 3-8% of the fully trained model, showing high data efficiency
Breakthrough Assessment
7/10
Strong empirical evidence that SLMs can rival GPT-4 in niche domains via code-based reasoning. The proposed evaluation for 'concept accuracy' using LLMs is a useful methodological contribution.
×