Evaluation Setup
Comprehensive suite including Chat, Instruction Following, Knowledge, Math, Code, and Agentic tasks in both English and Thai.
Benchmarks:
- MT-Bench (Multi-turn conversational quality (English and Thai))
- IFEval (Verifiable instruction adherence)
- Thai Code-Switching (CS) (Robustness to language mixing)
- OpenThaiEval (Thai exam-style questions and regional knowledge)
- HotpotQA (Agentic retrieval (evaluated with tools))
Metrics:
- Accuracy
- MT-Bench Score (1-10)
- Pass@1
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Ablation study on Qwen3 8B shows that the full Typhoon-S recipe (SFT+OPD) significantly outperforms SFT alone. |
| Average Score (All Benchmarks) |
Composite Score |
37.45 |
43.94 |
+6.49
|
| Thai Code-Switching |
Accuracy |
65.4 |
93.4 |
+28.0
|
| HotpotQA (Thai) |
Accuracy |
0.0 |
30.0 |
+30.0
|
| Comparison of the final Typhoon-S model (based on ThaiLLM) against a strong general-purpose multilingual model (Qwen3). |
| Thai Benchmark Average |
Composite Score |
66.66 |
71.20 |
+4.54
|
| OpenThaiEval |
Accuracy |
62.47 |
70.21 |
+7.74
|
Main Takeaways
- SFT alone is insufficient for robust instruction following in sovereign settings, leading to brittleness in code-switching and agentic tasks.
- On-Policy Distillation (OPD) with full logits provides critical robustness for long-tail tokens and mixed-language generation compared to Top-K distillation.
- Including target-language (Thai) data during SFT is essential; removing it causes massive regression in local capabilities, whereas OPD is more robust to data mix.
- The 'Typhoon-S' recipe successfully transforms a region-specific base model (ThaiLLM) into a competitive instruction-following model without expensive proprietary pipelines.