Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination

📝 Paper Summary

Hallucination suppression Factuality in specific domains

The paper empirically quantifies severe hallucinations in LLMs on financial tasks and demonstrates that prompt-based tool learning and RAG are more effective mitigations than few-shot prompting or decoding strategies.

Core Problem

General-purpose LLMs frequently generate unsupported or factually incorrect content (hallucinations) in financial tasks, which carries high risks like monetary loss and erosion of trust.

Why it matters:

Finance requires pinpoint accuracy; inaccuracies in stock prices or terminology can lead to severe real-world consequences like financial loss
There is a lack of empirical investigation into how often and to what extent LLMs hallucinate specifically within the intricate financial domain
Standard mitigation methods like few-shot learning may improve format following but fail to correct fundamental factual errors in domain-specific tasks

Concrete Example: When asked for the stock symbol of 'Perfumania Holdings', GPT-4 incorrectly provides 'PERF', failing to account for its delisting. Additionally, Llama-2-7B predicts historical stock prices with a mean absolute error of over $6000 in zero-shot settings.

Key Novelty

Empirical Benchmark for Financial Hallucinations

Establishes a three-task benchmark (acronym recognition, term explanation, stock price query) to quantify hallucinations in finance
Evaluates the comparative efficacy of four mitigation strategies: few-shot prompting, Decoding by Contrasting Layers (DoLa), RAG, and prompt-based tool learning
Demonstrates that domain-specific fine-tuning (FinMA) can paradoxically reduce general instruction-following abilities, leading to more hallucinations compared to base models

Architecture

Overview of the empirical examination framework, showing input questions, the hallucination problem (e.g., GPT-4 confusing 'TIF' definitions), and the mitigation methods (RAG, Tools).

Evaluation Highlights

Prompt-based tool learning achieves 100% accuracy on stock price queries for Llama-2 models with just one training example, compared to 0% accuracy without tools
RAG significantly improves FactScore on financial term explanations, raising Llama-2-7B-chat performance from 38.3% to 62.5%
GPT-4 achieves 90.4% accuracy on stock symbol recognition but still hallucinates outdated information (e.g., delisted stocks)

Breakthrough Assessment

7/10

Provides a necessary empirical reality check for FinLLMs, highlighting that standard fine-tuning isn't a silver bullet and establishing strong baselines for tool-augmented mitigation.

⚙️ Technical Details

Problem Definition

Setting: Evaluating LLM generation against ground truth in three financial tasks: abbreviation recognition, terminology explanation, and historical data querying

Inputs: Natural language questions asking for financial acronym expansions, term definitions, or historical stock prices

Outputs: Textual responses (acronyms/definitions) or specific numerical values (stock prices)

Pipeline Flow

Input Query (Financial Question)
Mitigation Strategy Application (None / Few-shot / DoLa / RAG / Tool)
Model Generation (Llama-2 / GPT)
Evaluation (Accuracy / FactScore / MAE)

System Modules

Input Processor

Formats the user query into specific prompts (zero-shot or few-shot)

Model or implementation: N/A

Retrieval Module (RAG only) (Mitigation)

Retrieves relevant context from Wikipedia using FAISS vector store

Model or implementation: FAISS vector store

Tool Generator (Tool Learning only) (Mitigation)

Generates a Python function call for Alpha Vantage API

Model or implementation: LLM (Llama-2 or GPT)

Generator

Produces final answer or stock price

Model or implementation: Llama-2-7B, Llama-2-7B-chat, GPT-3.5-turbo, GPT-4, FinMA-7B

Novel Architectural Elements

Application of prompt-based tool learning specifically for historical financial data retrieval verification

Modeling

Base Model: Llama-2-7B and Llama-2-7B-chat (HuggingFace weights)

Comparison to Prior Work

vs. FinMA-7B: This paper shows FinMA-7B underperforms its base Llama-1 model in instruction following, suggesting multi-task finetuning can degrade general abilities
vs. General RAG [not cited in paper]: specifically evaluates RAG for financial terminology correctness using FactScore rather than general QA metrics
vs. DoLa: Shows DoLa is limited when the model fundamentally lacks the specific financial facts (e.g., specific stock prices) in pre-training data

Limitations

Evaluation is limited to three specific tasks; does not cover complex financial reasoning or report generation
Reliability of DoLa is constrained by the underlying model's pre-training knowledge
GPT-3.5/GPT-4 evaluation relies on API availability and costs
Manual verification of 'outdated' information (like delisted stocks) implies reliance on static ground truth which changes over time

Reproducibility

Code: https://github.com/mk322/fin_hallu

📊 Experiments & Results

Evaluation Setup

Empirical evaluation on 3 tasks: Acronym Recognition, Term Explanation, Stock Price Query

Benchmarks:

Financial Acronym Recognition (Knowledge Probing / QA) [New]
Financial Term Explanation (Long-form generation) [New]
Stock Price Query (Numerical data retrieval) [New]

Metrics:

Accuracy (Exact match / Substring match)
FactScore (atomic fact verification)
Mean Absolute Error (MAE) for prices
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Financial Term Explanations (Task II) showing the impact of mitigation strategies.
Financial Term Explanations	FactScore	38.30	62.50	+24.20
Financial Term Explanations	FactScore	38.30	42.10	+3.80
Performance on Stock Price Query (Task III) highlighting the necessity of external tools.
Stock Price Query	Accuracy	0.00	100.00	+100.00
Stock Price Query	MAE (USD)	6380.5	6380.5	0.0
General knowledge probing results (Task I) comparing models.
Acronym Recognition	Accuracy	38.0	82.5	+44.5

Main Takeaways

Off-the-shelf LLMs exhibit serious hallucination in finance; Llama-2 models have near-zero capability for accurate historical stock prices without tools
Domain-specific fine-tuning (FinMA) can degrade performance compared to base models, likely due to 'catastrophic forgetting' of instruction-following abilities
RAG consistently outperforms DoLa and Few-shot prompting for improving factuality in explanations
Prompt-based tool learning is the only viable strategy for precise numerical tasks like stock price retrieval, achieving 100% accuracy where pure generation fails completely

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of financial terminology and stock market data

Key Terms

RAG: Retrieval-Augmented Generation—enhancing model output by retrieving relevant documents (e.g., from Wikipedia) before generation

DoLa: Decoding by Contrasting Layers—a decoding strategy that contrasts outputs from different model layers to amplify factual knowledge and reduce hallucinations

FactScore: A metric that quantifies the ratio of correct atomic facts in a generated response compared to a reference source (e.g., Wikipedia)

FinMA: A multi-task fine-tuned version of LLaMA-1 specialized for financial tasks

Prompt-based tool learning: A method where the LLM is prompted to generate code/function calls (e.g., Python API wrappers) to fetch external data instead of relying on internal memory

MAE: Mean Absolute Error—measure of the average size of mistakes in a collection of predictions, without considering their direction

Greedy decoding: A decoding method where the model selects the most probable next token at each step

Few-shot prompting: Providing the model with a small number of example input-output pairs in the prompt to guide its generation