Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

📝 Paper Summary

Sovereign LLMs Low-resource post-training Multilingual adaptation

Typhoon-S is a minimal post-training recipe that combines supervised fine-tuning, on-policy distillation, and knowledge-injected reinforcement learning to adapt base models for sovereign capabilities using limited compute.

Core Problem

Developing state-of-the-art LLMs typically requires massive compute and data resources inaccessible to sovereign entities (nations/regions), while existing open weights lack specific regional capabilities (e.g., local legal reasoning).

Why it matters:

Sovereign entities need control over model weights and training data for high-stakes local tasks (e.g., legal, cultural) but cannot afford standard industry-scale training
Standard post-training pipelines rely on massive general-purpose corpora that under-represent low-resource languages, leading to poor performance on region-specific tasks
Resource gatekeeping prevents academic and national groups from transforming base models into capable assistants

Concrete Example: A sovereign-adapted base model might perform well on Thai multiple-choice exams but fails completely at general instruction following or tool use (e.g., scoring 0 on Agentic tasks) because it lacks the massive alignment data used by frontier models.

Key Novelty

Typhoon-S Post-Training Recipe & InK-GRPO

Combines lightweight Supervised Fine-Tuning (SFT) with On-Policy Distillation (OPD), where the student model learns from a teacher's distribution on its *own* generated outputs to correct errors dynamically
Introduces InK-GRPO (Injected Knowledge GRPO), which adds a next-token prediction loss to the reinforcement learning objective, allowing the model to learn new domain knowledge while optimizing for reasoning performance

Architecture

The data construction pipeline for the Thai target-language dataset

Evaluation Highlights

+6.49 points improvement in average score (37.45 → 43.94) when adding On-Policy Distillation (OPD) to standard SFT on Qwen3 8B Instruct
+28.0 points improvement on Thai Code-Switching consistency (65.4 → 93.4) using the full SFT+OPD recipe compared to SFT alone
Typhoon-S-8B outperforms the strong baseline Qwen3-8B on Thai-specific benchmarks (Avg 71.20 vs. 66.66) while using only academic-scale compute

Breakthrough Assessment

8/10

Provides a practical, low-resource blueprint for 'sovereign' AI that is highly relevant for non-English/Chinese contexts. The InK-GRPO method addresses the specific challenge of teaching knowledge during RL.

⚙️ Technical Details

Problem Definition

Setting: Post-training a base Large Language Model (LLM) to acquire instruction following (adoptability) and region-specific skills (sovereign capability) under compute constraints

Inputs: Base model weights, small target-language dataset, teacher model

Outputs: Instruction-tuned Sovereign LLM (Typhoon-S)

Pipeline Flow

User Instruction
Typhoon-S Model (Agentic RFT-tuned)
External Tool/Retrieval (Optional)
Response Generation

System Modules

Typhoon-S Model

Generate responses to user queries, handling instruction following and domain-specific reasoning

Model or implementation: Typhoon-S-8B-Instruct (based on ThaiLLM-8B)

External Tools

Provide external knowledge for agentic tasks (e.g., Wikipedia API)

Model or implementation: Wikipedia API / Retrieval System

Novel Architectural Elements

InK-GRPO Loss Function: Integrates cross-entropy loss (for knowledge injection) directly into the GRPO reinforcement learning objective

Modeling

Base Model: ThaiLLM-8B (Typhoon-S-8B) and Qwen/Qwen3-4B-Base (for ablations)

Training Method: Multi-stage: (1) SFT, (2) On-Policy Distillation (OPD), (3) InK-GRPO (Reinforcement Fine-Tuning)

Objective Functions:

Purpose: SFT standard training.

Formally: Cross-entropy loss over instruction-response pairs.
Purpose: Distill teacher capabilities on student trajectories (OPD).

Formally: Minimize divergence D between teacher and student token-level distributions (Forward KL divergence) on student-sampled outputs.
Purpose: Optimize reasoning while injecting knowledge (InK-GRPO).

Formally: GRPO loss augmented with a cross-entropy (next-token prediction) loss.

Adaptation: Full fine-tuning (via FSDP with CPU offloading)

Trainable Parameters: Full model parameters (8B)

Training Data:

SFT: 200k Tulu 3 (General), Toucan Tool (Tool use), Thai Instruction (Local)
OPD: 160k subset of SFT data (downsampled English, full Thai)

Key Hyperparameters:

sft_learning_rate: 2e-5
sft_batch_size: 32
opd_learning_rate: 1e-6
+ 2 more
opd_lambda: 0.25 (fraction of on-policy student data)
sequence_packing_length: 16384

Compute: 8x H100 GPUs for approximately two days (for 8B model adoptability + sovereign capability stages)

Comparison to Prior Work

vs. Qwen3-8B-Instruct: Typhoon-S is explicitly tuned for 'sovereign' (Thai) capability, achieving higher scores on local benchmarks (OpenThaiEval, Thai MT-Bench) while maintaining adoptability.
vs. Standard RFT (PPO/GRPO): Typhoon-S introduces InK-GRPO to inject knowledge (via cross-entropy) simultaneously with preference optimization, whereas standard RFT assumes knowledge is already present [InK-GRPO is a novel contribution].
vs. Offline Distillation: Typhoon-S uses On-Policy Distillation (GKD), generating student trajectories during training to reduce train-inference mismatch.

Limitations

Relies on the availability of a stronger teacher model for distillation
Still lags behind frontier models in hard scientific knowledge (GPQA) and coding
Full-logits distillation requires efficient infrastructure to handle teacher inference overhead

Reproducibility

Code: https://github.com/scb-10x/typhoon-s

Code available at https://github.com/scb-10x/typhoon-s. Released models: Typhoon-S-8B-Instruct, Typhoon-S-4B-Legal-Agent. Released datasets: Typhoon-S-Instruct-Dataset, Typhoon-S-Sovereign-Capability-Dataset. Uses open datasets (Tulu 3, Toucan, WildChat).

📊 Experiments & Results

Evaluation Setup

Comprehensive suite including Chat, Instruction Following, Knowledge, Math, Code, and Agentic tasks in both English and Thai.

Benchmarks:

MT-Bench (Multi-turn conversational quality (English and Thai))
IFEval (Verifiable instruction adherence)
Thai Code-Switching (CS) (Robustness to language mixing)
OpenThaiEval (Thai exam-style questions and regional knowledge)
HotpotQA (Agentic retrieval (evaluated with tools))

Metrics:

Accuracy
MT-Bench Score (1-10)
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation study on Qwen3 8B shows that the full Typhoon-S recipe (SFT+OPD) significantly outperforms SFT alone.
Average Score (All Benchmarks)	Composite Score	37.45	43.94	+6.49
Thai Code-Switching	Accuracy	65.4	93.4	+28.0
HotpotQA (Thai)	Accuracy	0.0	30.0	+30.0
Comparison of the final Typhoon-S model (based on ThaiLLM) against a strong general-purpose multilingual model (Qwen3).
Thai Benchmark Average	Composite Score	66.66	71.20	+4.54
OpenThaiEval	Accuracy	62.47	70.21	+7.74

Main Takeaways

SFT alone is insufficient for robust instruction following in sovereign settings, leading to brittleness in code-switching and agentic tasks.
On-Policy Distillation (OPD) with full logits provides critical robustness for long-tail tokens and mixed-language generation compared to Top-K distillation.
Including target-language (Thai) data during SFT is essential; removing it causes massive regression in local capabilities, whereas OPD is more robust to data mix.
The 'Typhoon-S' recipe successfully transforms a region-specific base model (ThaiLLM) into a competitive instruction-following model without expensive proprietary pipelines.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Knowledge Distillation
Reinforcement Learning with Human Feedback (RLHF)
Proximal Policy Optimization (PPO) concepts

Key Terms

Sovereign Setting: A scenario where developers must retain control of model weights/data for regional/cultural requirements under limited resources

OPD: On-Policy Distillation—training a student model using feedback from a teacher model on sequences generated by the student itself (on-policy)

InK-GRPO: Injected Knowledge Group Relative Policy Optimization—an extension of GRPO that adds a next-token prediction loss to teach domain knowledge during reinforcement learning

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs against each other rather than using a separate value function

SFT: Supervised Fine-Tuning—training a model on input-output pairs to follow instructions

RFT: Reinforcement Fine-Tuning—optimizing a model using reinforcement learning signals (rewards) after initial fine-tuning

Code-switching: Alternating between two or more languages (e.g., Thai and English) within a single conversation or sentence