Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

📝 Paper Summary

Financial Large Language Models Reasoning Models Reinforcement Learning for LLMs

Fin-R1 is a 7-billion parameter financial LLM trained via supervised fine-tuning and Group Relative Policy Optimization on a distilled high-quality reasoning dataset, achieving strong performance with low deployment costs.

Core Problem

General-purpose reasoning models struggle in finance due to fragmented data sources, lack of transparency (black box nature) required for compliance, and weak transferability to specific business scenarios.

Why it matters:

Financial tasks require integrating heterogeneous knowledge (legal, economic, quantitative) which general models often fail to do coherently
Regulatory environments demand traceability and explainability, but most models output answers without transparent reasoning paths
Existing financial models typically rely on pre-training or simple fine-tuning, lacking the deep reasoning capabilities needed for complex tasks like risk pricing

Concrete Example: Financial data often contains contradictory signals across dispersed sources (e.g., contractual terms vs. market signals). A standard model might memorize training examples, failing to generalize when these signals conflict in new scenarios, whereas a reasoning model needs to explicitly weigh the evidence.

Key Novelty

Two-stage Financial Reasoning Post-Training

Constructs a specialized reasoning dataset (Fin-R1-Data) by distilling reasoning traces from DeepSeek-R1 and filtering them with Qwen2.5-72B-Instruct
Applies Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO) to enforce both correctness and structured, interpretable reasoning chains in a small 7B model

Architecture

The overall two-stage framework pipeline: Data Construction (Distillation & Filtering) and Model Training (SFT + RL).

Evaluation Highlights

Achieved an average score of 75.2 on established financial reasoning benchmarks
Outperformed existing state-of-the-art models of the same 7B scale by more than 17 points
Ranked second overall among tested models, surpassing many larger general-purpose models

Breakthrough Assessment

8/10

Successfully adapts the 'reasoning model' paradigm (like o1/DeepSeek-R1) to a specific vertical (finance) using a 7B model, demonstrating that smaller domain-specific models can achieve high reasoning performance via specialized RL.

⚙️ Technical Details

Problem Definition

Setting: Financial Question Answering and Reasoning

Inputs: Financial query q (potentially requiring calculation, logic, or domain knowledge)

Outputs: Reasoning trace (Chain-of-Thought) followed by final answer a

Pipeline Flow

User Input -> Fin-R1 (7B Model) -> Generated Reasoning Trace -> Final Answer

System Modules

Fin-R1

Generate reasoning path and answer

Model or implementation: 7B parameter transformer (base architecture implied as standard dense LLM)

Novel Architectural Elements

Integration of format-enforcing reward function during GRPO to ensure transparent, human-readable reasoning traces

Modeling

Base Model: 7 billion parameters (specific base initialization not explicitly named in snippet, likely Qwen or Llama series given the ecosystem)

Training Method: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (GRPO)

Objective Functions:

Purpose: Optimize model policy to maximize reward while remaining close to the reference policy.

Formally: GRPO objective (using group-relative advantages instead of a value function)
Purpose: Ensure reasoning transparency and structural consistency.

Formally: Format reward function (Equation 4.2 in paper, exact formula not in snippet)

Training Data:

Fin-R1-Data: 60,091 samples
Distilled from DeepSeek-R1-671B (Temperature 0.6)
Filtered using Qwen2.5-72B-Instruct as judge (99.6% accuracy in answer checking)
Includes 4 categories: Advanced business (25%), Basic business (>50%), Professional knowledge (incl. FinPEE), Financial code

Key Hyperparameters:

distillation_temperature: 0.6

Compute: Reduces deployment costs due to compact 7B size

Comparison to Prior Work

vs. BloombergGPT: Fin-R1 focuses on explicit CoT reasoning and is much smaller (7B vs 50B+)
vs. DeepSeek-R1: Fin-R1 is domain-adapted to finance and significantly smaller (7B vs 671B MoE)
vs. Standard SFT models: Fin-R1 uses RL (GRPO) to enhance reasoning beyond simple pattern matching

Limitations

Dependency on the quality of the teacher model (DeepSeek-R1) for data distillation
Performance bounds of a 7B model compared to frontier models (though competitive)
Reliance on fragmented financial data sources which may still contain inconsistencies despite filtering

Reproducibility

Code: https://github.com/SUFE-AIFLM-Lab/Fin-R1

Code is publicly available at https://github.com/SUFE-AIFLM-Lab/Fin-R1. The Fin-R1-Data dataset construction process is detailed (Source -> Distillation -> Filtering). FinPEE dataset constructed from proprietary exam questions.

📊 Experiments & Results

Evaluation Setup

Evaluation on established financial benchmarks and practical application scenarios

Benchmarks:

FinQA (Financial Question Answering with numerical reasoning)
FinPEE (Financial Postgraduate Entrance Exam questions) [New]

Metrics:

Accuracy
Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Financial Reasoning Benchmarks (Average)	Average Score	Not reported in the paper	75.2	Not reported in the paper

Experiment Figures

Sample outputs of Fin-R1 in Chinese and English, demonstrating explicit reasoning paths.

Main Takeaways

Fin-R1 achieves an average score of 75.2 on financial reasoning benchmarks, ranking second overall.
Outperforms peer 7B models by a margin of over 17 points, demonstrating the efficacy of the RL-based reasoning pipeline.
The two-stage training (SFT + GRPO) effectively enables a small model to perform complex reasoning tasks previously reserved for much larger models.
Data quality is critical: filtering distilled data with a strong judge (Qwen2.5-72B) ensures high-quality reasoning traces for training.

📚 Prerequisite Knowledge

Prerequisites

Basics of Large Language Model training (SFT, RLHF)
Understanding of Chain-of-Thought (CoT) reasoning
Familiarity with Reinforcement Learning algorithms (PPO, GRPO)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that improves reasoning by generating multiple outputs for a prompt and optimizing based on their relative performance within the group, avoiding the need for a separate value network

CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it specific behaviors or knowledge

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm for LLMs that constrains policy updates to ensure stability

Distillation: The process of training a smaller 'student' model to mimic the outputs or reasoning of a larger, more capable 'teacher' model

LLM-as-a-Judge: Using a strong LLM to evaluate the quality or correctness of outputs from another model