Domain-Specific Language Model Post-Training for Indonesian Financial NLP

📝 Paper Summary

Financial NLP Domain Adaptation Low-Resource Language NLP

This paper releases the first dedicated Indonesian financial language models and evaluation datasets, showing that post-training generic IndoBERT on financial texts improves performance on downstream tasks like sentiment analysis.

Core Problem

Existing Indonesian language models (IndoBERT) are trained on general domain text, lacking the specific vocabulary and nuances required for accurate financial analysis.

Why it matters:

Financial institutions in Indonesia increasingly rely on unstructured text data for decision-making, but general models may misinterpret technical financial jargon.
Most financial NLP research focuses on English (e.g., FinBERT), leaving a gap for specialized models in low-resource languages like Indonesian.

Concrete Example: A general model might interpret a phrase like 'share buyback' neutrally, whereas a financial model understands it as a potentially positive signal for stock value.

Key Novelty

Indonesian Financial Post-Training & Benchmark Suite

Continual pre-training of IndoBERT using a newly constructed corpus of Indonesian financial news and corporate reports to adapt the model to the financial domain.
Creation of the first comprehensive Indonesian financial NLP benchmark, including a native sentiment dataset (IndoFinSent) and translated versions of standard English financial datasets.

Architecture

Overview of the domain-specific post-training and fine-tuning workflow.

Evaluation Highlights

Post-trained Base model achieves 0.81 F1 (+26% improvement over baseline) on sentiment analysis when fine-tuned on only 30% of training data.
Achieves 0.94 F1 on sentiment analysis with full training data, outperforming the generic IndoBERT baseline (0.91 F1).
Demonstrates that domain-specific post-training is particularly effective for smaller model architectures (Base) compared to Larger ones.

Breakthrough Assessment

7/10

Significant contribution to low-resource financial NLP by releasing datasets and models. While the methodology (post-training BERT) is standard, the resource creation for Indonesian is novel and valuable.

⚙️ Technical Details

Problem Definition

Setting: Domain adaptation of pre-trained language models via continual pre-training followed by fine-tuning on downstream classification tasks.

Inputs: Unlabeled financial text for post-training; Labeled sentences for fine-tuning.

Outputs: Domain-adapted language model; Class labels (Sentiment or Topic).

Pipeline Flow

Corpus Construction (Scraping news/reports)
Post-Training (IndoBERT + Financial Corpus via MLM)
Fine-Tuning (Sentiment Analysis & Topic Classification)

System Modules

IndoBERT Base/Large

Backbone contextual language model initialized with general Indonesian weights

Model or implementation: IndoBERT (base: 124.5M params, large: 335.2M params)

Classification Head

Predicts class labels based on CLS token embedding

Model or implementation: Linear layer on top of BERT

Modeling

Base Model: IndoBERT (based on BERT architecture)

Training Method: Masked Language Modeling (MLM) on domain-specific corpus

Objective Functions:

Purpose: Predict masked tokens to learn domain vocabulary.

Formally: Standard BERT MLM loss.

Adaptation: Full model update during post-training and fine-tuning

Training Data:

Financial News: 357k words from CNBC Indonesia and Bisnis.com
Corporate Reports: 290k words from investor relations PDFs
Total Corpus: ~647k words (4.85 MB)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 8
epochs_post_training: 20
+ 2 more
weight_decay: 0.01
mlm_probability: 0.15

Compute: Training performed on single GPU T4. Base model post-training: ~35 mins. Large model post-training: ~55 mins.

Comparison to Prior Work

vs. IndoBERT (Baseline): This work adds a domain-specific post-training stage using financial corpora.
vs. FinBERT: This work targets the Indonesian language, whereas FinBERT is for English.
vs. General Pre-training: Uses a small-scale domain corpus (4.85MB) rather than massive general corpora.

Limitations

Small scale of domain-specific corpus (only ~647k words compared to GBs in English models).
Reliance on machine-translated datasets (Phrasebank, Twitter Financial News) for some evaluation tasks.
Large model post-training showed minimal gains, potentially due to insufficient domain data size relative to model capacity.

Reproducibility

Code: https://github.com/intanq/indonesian-financial-domain-lm

Publicly available: Code, post-trained models, and constructed datasets (IndoFinSent, translated Phrasebank, financial corpus) are on GitHub and HuggingFace.

📊 Experiments & Results

Evaluation Setup

Fine-tuning post-trained models on sentiment analysis and topic classification tasks with varying training data sizes (10% to 100%).

Benchmarks:

Translated Financial Phrasebank (Sentiment Analysis (3 classes)) [New]
IndoFinSent (Sentiment Analysis (Native Indonesian)) [New]
Translated Twitter Financial News (Topic Classification (20 topics)) [New]

Metrics:

F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Sentiment Analysis (Base Model) results using Translated Financial Phrasebank. The post-trained models generally outperform the generic baseline, especially with limited data.
Translated Financial Phrasebank	F1 score	0.91	0.94	+0.03
Translated Financial Phrasebank	F1 score	0.55	0.81	+0.26
Topic Classification (Base Model) results using Translated Twitter Financial News. Gains are modest but consistent with Financial News post-training.
Translated Twitter Financial News	F1 score	0.85	0.85	0.00
Translated Twitter Financial News	F1 score	0.64	0.66	+0.02
Native Indonesian Dataset Evaluation. Validating on real-world native data.
IndoFinSent	F1 score	Not reported in the paper	0.81	Not reported in the paper

Main Takeaways

Domain-specific post-training significantly improves performance when fine-tuning data is scarce (e.g., +26% F1 with 30% data).
Base models benefit much more from post-training than Large models, likely because Large models already capture sufficient features or require larger domain corpora to adapt further.
Post-training on data similar to the downstream task (Financial News vs. Corporate Reports) yields better results; News-based post-training helped News-based classification most.

📚 Prerequisite Knowledge

Prerequisites

BERT architecture and Masked Language Modeling (MLM)
Transfer Learning (Pre-training vs. Fine-tuning)
NLP evaluation metrics (F1 score)

Key Terms

IndoBERT: A pre-trained BERT model specifically trained on massive Indonesian corpora (Indo4B).

Post-training: Also known as continual pre-training; the process of further training a pre-trained model on domain-specific data before fine-tuning.

IndoFinSent: A newly constructed Indonesian financial sentiment analysis dataset created by the authors.

Financial Phrasebank: A standard English financial sentiment dataset, translated to Indonesian for this study.

MLM: Masked Language Modeling—a training objective where the model predicts hidden words in a sentence.