FinMoE: A MoE-based Large Chinese Financial Language Model

📝 Paper Summary

Financial Large Language Models Mixture-of-Experts (MoE) Architecture

FinMoE is a dense Mixture-of-Experts financial language model that activates all experts simultaneously to balance specialized financial expertise with general reasoning capabilities.

Core Problem

General-purpose models lack depth in financial specifics, while models tailored exclusively to finance lose broader reasoning abilities needed for complex tasks.

Why it matters:

Financial tasks require both precise domain terminology and general common-sense reasoning to solve real-world problems like risk assessment.
Sparse MoE models can suffer from training instability and uneven expert utilization, limiting their effectiveness in integrating diverse knowledge types.

Concrete Example: Answering a financial question often requires integrating specific methodologies (domain knowledge) with general contextual awareness; a purely general model might miss the methodology, while a purely financial one might fail the reasoning steps.

Key Novelty

Dense Mixture-of-Experts for Domain Adaptation

Unlike standard Sparse MoE that selects top-k experts, FinMoE uses a Dense MoE where all experts are activated for every input token.
Outputs from all expert networks are combined via a dynamic weighted summation based on the input, ensuring every expert contributes to the final representation.

Evaluation Highlights

Achieves a score of 80 on the Finance benchmark, significantly outperforming Qwen-7B (30.2) and Yi-6B (19.4).
Maintains strong general capabilities, scoring 70.6 on Knowledge tasks compared to Qwen-7B's 67.6.
Demonstrates balanced performance across 6 domains (Language, Knowledge, Reasoning, Subject, Code, Finance) with an average score of 62.5, higher than baselines.

Breakthrough Assessment

7/10

Strong empirical results on financial benchmarks using a dense MoE approach. While the architecture (Dense MoE) is known, applying it specifically to balance financial vs. general trade-offs is a solid application contribution.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling for financial and general text generation

Inputs: Natural language context/question sequence x

Outputs: Predicted next tokens / Answer sequence

Pipeline Flow

Input Embedding
Stack of Dense MoE Layers (Expert Networks + Gating Network)
Output Projection

System Modules

Gating Network (MoE Layer)

Determine the weighting for each expert based on input

Model or implementation: Linear layer + Softmax

Expert Networks (MoE Layer)

Process input using specialized sub-networks

Model or implementation: Multi-layer MLP (same structure as FFN in Transformers)

Novel Architectural Elements

Application of Dense MoE specifically for the financial domain adaptation to prevent the trade-off between general and domain capabilities found in sparse or dense-single models

Modeling

Base Model: FinMoE-7B (Custom architecture based on LLaMA structure but with MoE layers)

Training Method: Hybrid-tuning (Supervised Fine-Tuning mixed with Pre-training objectives)

Objective Functions:

Purpose: Predict the next token in the sequence.

Formally: p(x) = Product(p(x_t | x_<t))

Training Data:

Pre-training corpus: General-domain data + Financial-domain data
Instruction data: General instructions + Financial instructions generated via Self-QA

Key Hyperparameters:

position_embedding: RoPE
activation: SwiGLU
normalization: RMSNorm

Compute: Not reported in the paper

Comparison to Prior Work

vs. Qwen/Yi: FinMoE uses a Dense MoE architecture specifically trained on financial data, whereas baselines are dense general-purpose models.
vs. Switch Transformer [not cited in paper]: FinMoE activates all experts (dense) rather than top-k (sparse) to avoid training instability and load balancing issues.

Limitations

Dense MoE incurs higher computational costs during inference compared to Sparse MoE since all experts are active.
The paper provides limited details on training data size, expert count (N), or specific compute resources used.
No statistical significance tests reported for the benchmark results.

Reproducibility

No replication artifacts mentioned in the paper. Code, weights, and specific datasets are not linked or described in detail.

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmark across general and financial domains

Benchmarks:

FinanceIQ / CGCE (Financial reasoning and understanding)
CommonsenseQA / TriviaQA / OpenbookQA (General knowledge and reasoning)

Metrics:

Accuracy / Score (0-100 scale based on table values)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FinMoE demonstrates superior performance in the Finance domain while maintaining competitive general capabilities.
Finance	Score	30.2	80.0	+49.8
Knowledge	Score	67.6	70.6	+3.0
Reasoning	Score	59.1	58.5	-0.6
Average	Score	57.5	62.5	+5.0

Main Takeaways

FinMoE effectively balances domain specialization with general capabilities, unlike baselines that typically excel in one but fail in the other.
The Dense MoE architecture allows for significant gains in financial tasks (+49.8 vs Qwen) without catastrophic forgetting of general knowledge.
Hybrid-tuning strategy appears effective in aligning the model with diverse task instructions.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Mixture of Experts (MoE) concepts
Large Language Model pre-training and fine-tuning pipelines

Key Terms

MoE: Mixture of Experts—a neural network architecture where different parts of the model (experts) specialize in different tasks or data patterns

Dense MoE: A variation of MoE where all experts are activated and computed for every input, rather than selecting a sparse subset

Sparse MoE: The traditional MoE approach where only a few experts (e.g., top-2) are active per token to save compute

SwiGLU: A specific activation function (Swish-Gated Linear Unit) commonly used in modern LLMs like LLaMA

RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes better to longer sequences

Self-QA: A data generation method where the model generates its own question-answer pairs from unsupervised text to create instruction-tuning data

Hybrid-tuning: A fine-tuning strategy that mixes pre-training data (completion) with instruction data (chat) to prevent catastrophic forgetting

RMSNorm: Root Mean Square Layer Normalization—a normalization technique used to stabilize training in deep networks