Evaluation Setup
Fine-tuning post-trained models on sentiment analysis and topic classification tasks with varying training data sizes (10% to 100%).
Benchmarks:
- Translated Financial Phrasebank (Sentiment Analysis (3 classes)) [New]
- IndoFinSent (Sentiment Analysis (Native Indonesian)) [New]
- Translated Twitter Financial News (Topic Classification (20 topics)) [New]
Metrics:
- F1 score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Sentiment Analysis (Base Model) results using Translated Financial Phrasebank. The post-trained models generally outperform the generic baseline, especially with limited data. |
| Translated Financial Phrasebank |
F1 score |
0.91 |
0.94 |
+0.03
|
| Translated Financial Phrasebank |
F1 score |
0.55 |
0.81 |
+0.26
|
| Topic Classification (Base Model) results using Translated Twitter Financial News. Gains are modest but consistent with Financial News post-training. |
| Translated Twitter Financial News |
F1 score |
0.85 |
0.85 |
0.00
|
| Translated Twitter Financial News |
F1 score |
0.64 |
0.66 |
+0.02
|
| Native Indonesian Dataset Evaluation. Validating on real-world native data. |
| IndoFinSent |
F1 score |
Not reported in the paper |
0.81 |
Not reported in the paper
|
Main Takeaways
- Domain-specific post-training significantly improves performance when fine-tuning data is scarce (e.g., +26% F1 with 30% data).
- Base models benefit much more from post-training than Large models, likely because Large models already capture sufficient features or require larger domain corpora to adapt further.
- Post-training on data similar to the downstream task (Financial News vs. Corporate Reports) yields better results; News-based post-training helped News-based classification most.