FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

📝 Paper Summary

Financial Natural Language Processing (FinNLP) Prompt Engineering Chain-of-Thought Reasoning

FinCoT enhances Large Language Model performance on financial tasks by injecting expert reasoning blueprints (encoded as Mermaid diagrams) into a structured Chain-of-Thought prompt without requiring model fine-tuning.

Core Problem

General-purpose Chain-of-Thought prompting lacks domain-specific constraints, leading LLMs to omit critical financial checks (e.g., valuation, unit conversion) or use incorrect formulas.

Why it matters:

Financial decision-making requires precise mathematics and adherence to standard workflows (e.g., discounting, portfolio attribution) that generic models often miss.
Existing solutions rely on fine-tuning or few-shot exemplars, which demand labeled data and lack explicit control over the intermediate reasoning structure.
Lack of interpretability and alignment with expert practice in current financial LLMs hinders their adoption in regulated high-stakes environments.

Concrete Example: In finance, a model might confuse basis points with percentages or skip a valuation check. FinCoT prevents this by forcing the model to follow a blueprint that explicitly lists 'Check Units' or 'Apply Discount Formula' as mandatory steps.

Key Novelty

FinCoT (Financial Chain-of-Thought)

Embeds domain-specific 'expert blueprints' (visualized as Mermaid diagrams) directly into the prompt as a hint, guiding the model through a standard professional workflow.
Uses a structured tag-based format (<thinking>, <output>) combined with a 'semi-reflection' step where the model verifies its own reasoning before finalizing the answer.

Architecture

The FinCoT prompting framework structure compared to standard inputs.

Evaluation Highlights

Boosts Qwen3-8B-Base accuracy from 63.2% (Standard Prompting) to 80.5% (+17.3pp) on CFA-style questions.
Improves Fin-R1 (7B) accuracy from 65.7% to 75.7% (+10.0pp), showing benefits even for domain-specific models.
Reduces output token length by ~8x compared to unstructured Chain-of-Thought while improving reasoning clarity.

Breakthrough Assessment

7/10

Significant accuracy gains and efficiency improvements without fine-tuning. The use of Mermaid diagrams as prompt constraints is a clever, interpretable innovation, though tested primarily on CFA-style multiple-choice questions.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Question Answering on financial domain tasks (CFA-style)

Inputs: Financial question q and a domain-specific expert blueprint B

Outputs: Structured reasoning trace r and final answer a

Pipeline Flow

Domain Classification (External Step) -> FinCoT Prompt Construction -> LLM Inference

System Modules

Prompt Constructor

Assembles the prompt with System instructions, the User Question, and the Domain-Specific Mermaid Blueprint

Model or implementation: N/A (Prompt Template)

Reasoning Engine

Generates the response following the structured tags

Model or implementation: Qwen3-8B-Base / Fin-R1 / etc.

Novel Architectural Elements

Injection of Mermaid syntax blueprints as 'hints' within a zero-shot prompt to constrain reasoning flow
Integration of a 'semi-reflection' step inside the final output tag rather than as a separate conversational turn

Modeling

Base Model: Qwen3-8B-Base (primary evaluation target)

Comparison to Prior Work

vs. ST-CoT: Adds domain-specific Mermaid diagrams as constraints.
vs. Plan-and-Solve: FinCoT provides the 'plan' (blueprint) externally based on expert knowledge rather than asking the model to generate it [not cited in paper].
vs. Fine-tuned models (Fin-R1): FinCoT is a prompting strategy that can be applied to any model, improving even fine-tuned ones without training.

Limitations

Depends on accurate external classification of the question's domain to select the correct blueprint.
Less effective on models that are already heavily instruction-tuned (e.g., Gemma-3-12B-IT showed smaller gains).
Performance gains are lower in non-quantitative domains like Ethics.
Mermaid blueprints are static and handcrafted; they may not cover every edge case within a domain.

Reproducibility

Blueprints are described in Appendix A. The dataset used is the CFA-Easy subset of FinEval (Flare-CFA). Prompt templates are provided in Appendix B. Code URL is not provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Zero-shot Multiple Choice Question Answering

Benchmarks:

Flare-CFA (CFA-Easy subset of FinEval) (Financial QA)

Metrics:

Accuracy (%)
Average Output Length (tokens)
Statistical methodology: Paired bootstrap test with B=10k resamples; p-values reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FinCoT significantly outperforms Standard Prompting (SP), Unstructured CoT (UST-CoT), and Structured CoT (ST-CoT) on the base model Qwen3-8B-Base.
Flare-CFA	Accuracy	63.2	80.5	+17.3
Flare-CFA	Accuracy	73.3	80.5	+7.2
FinCoT improves performance even on specialized financial models.
Flare-CFA	Accuracy	65.7	75.7	+10.0
FinCoT drastically reduces token usage compared to Unstructured CoT.
Flare-CFA	Average Output Length	1369.4	154.5	-1214.9

Experiment Figures

Comparison of reasoning traces between Standard Prompting, Unstructured CoT, Structured CoT, and FinCoT.

Main Takeaways

FinCoT is most effective on base/pretrained models (e.g., Qwen3-8B-Base) compared to instruction-tuned models, suggesting it provides the structure that raw models lack.
The method consistently reduces verbosity (~8x fewer tokens than UST-CoT), making inference faster and cheaper while improving accuracy.
Improvements are highest in quantitative domains (e.g., Derivatives, Quantitative Methods) where structured steps are crucial, and lower in qualitative domains like Ethics.
Structured CoT (ST-CoT) alone improves over Standard Prompting, but adding the Expert Blueprints (FinCoT) yields further significant gains.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Zero-shot prompting
Basic financial concepts (CFA curriculum)
Mermaid diagram syntax

Key Terms

FinCoT: Financial Chain-of-Thought—a prompting strategy that embeds expert workflows as diagrams to guide LLM reasoning.

Mermaid: A text-based syntax for generating diagrams and charts, used here to encode reasoning blueprints that LLMs can parse.

ST-CoT: Structured Chain-of-Thought—prompting that enforces specific tags like <thinking> and <output> to organize reasoning.

UST-CoT: Unstructured Chain-of-Thought—standard free-form step-by-step reasoning prompting.

SP: Standard Prompting—zero-shot prompting where the model is asked to answer directly without explicit reasoning steps.

CFA: Chartered Financial Analyst—a professional designation; used here to refer to the difficulty and style of the evaluation questions.

FinNLP: Financial Natural Language Processing—applying NLP techniques specifically to the finance domain.

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to adapt it to a specific task.

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used for training some of the baseline models (e.g., DianJin-R1, Fin-o1).

semi-reflection: A simplified self-verification step included within the <output> block of FinCoT, avoiding a separate complex reflection phase.