Mi: dm 2.0 Korea-centric Bilingual Language Models

📝 Paper Summary

LLM Pre-training Korean Language Models Data Curation

Mi:dm 2.0 is a bilingual LLM engineered through rigorous data filtering and synthetic augmentation to internalize Korean societal values and reasoning patterns rather than just linguistic translation.

Core Problem

Existing Korean LLMs are often trained on low-quality or insufficient data, leading to hallucinations, unnatural phrasing, and a lack of alignment with Korean cultural norms.

Why it matters:

Models trained on generic web data often produce emotionally incongruent or culturally insensitive responses in high-stakes local applications
The scarcity of high-quality Korean corpora compared to English creates a structural performance gap for non-English languages
Prior models frequently revert to English or hallucinate when faced with specific Korean cultural contexts due to data misalignment

Concrete Example: Existing models might translate a query about 'King Sejong' literally or factually incorrectly due to poor data, whereas Mi:dm 2.0 is trained on curated historical and cultural datasets to ensure culturally accurate outputs.

Key Novelty

Korea-centric Data Pipeline & Depth-up Scaling

Implements a rigorous 'Korea-centric' data pipeline that prioritizes cultural alignment and reasoning over raw token count, heavily supplementing organic data with synthetic textbook-style rewrites
Utilizes Depth-up Scaling (DuS) to efficiently expand an 8B base model into an 11.5B model by leveraging learned representations without training from scratch

Architecture

The 8-stage data filtering pipeline for Korean web data

Evaluation Highlights

Achieves top-tier zero-shot results on KMMLU (Korean Massive Multitask Language Understanding) among Korean-specific benchmarks
Demonstrates strong performance in internal evaluations across language, humanities, and social science tasks compared to comparable domestic models
Successfully deployed in two sizes (2.3B Mini and 11.5B Base) to cover both resource-constrained and general-purpose use cases

Breakthrough Assessment

6/10

Solid contribution to region-specific LLMs with a strong focus on data quality and cultural alignment, though the architectural innovation (Depth-up Scaling) is an application of existing techniques rather than a novel method.

⚙️ Technical Details

Problem Definition

Setting: Pre-training and instruction-tuning of Large Language Models specifically for Korean cultural and linguistic alignment

Inputs: Korean and English text sequences

Outputs: Next-token prediction / Generated text response

Pipeline Flow

Data Collection (Organic & Synthetic)
Data Filtering & Refinement Pipeline
Tokenizer Training
Pre-training (Base or Mini)
Instruction Tuning

System Modules

Data Cleansing Pipeline (Data Processing)

Filter raw web data to remove noise, PII, and low-quality text

Model or implementation: Ensemble of binary classifiers (Quality, Educational, Toxic)

Synthetic Data Generator (Data Processing)

Augment scarce Korean data with translated/rewritten content and reasoning chains

Model or implementation: Language Models (specifics not detailed)

Tokenizer

Convert text into tokens optimized for Korean morphology

Model or implementation: Custom Korean-optimized tokenizer

Mi:dm 2.0 Base (Core Model)

General-purpose language generation and reasoning

Model or implementation: 11.5B parameter Transformer (Depth-up Scaled from 8B)

Mi:dm 2.0 Mini (Core Model)

Efficient generation for resource-constrained environments

Model or implementation: 2.3B parameter Transformer

Novel Architectural Elements

Application of Depth-up Scaling (DuS) to expand a pre-trained 8B Korean model into an 11.5B model, reusing learned representations

Modeling

Base Model: Mi:dm 2.0 Base (11.5B) and Mi:dm 2.0 Mini (2.3B)

Training Method: Pre-training followed by Instruction Tuning

Trainable Parameters: 11.5B (Base), 2.3B (Mini)

Training Data:

85.7% Organic Data (Web, Books, News, Papers)
14% Synthetic Data (Rewrites, Translations, CoT)
Strict 8-stage filtering pipeline applied to Common Crawl data

Compute: Not reported in the paper

Comparison to Prior Work

vs. English-centric LLMs: Mi:dm 2.0 is explicitly trained on culturally aligned Korean data to avoid 'translationese' and cultural misalignment
vs. Standard pre-training: Uses Depth-up Scaling (DuS) to efficiently scale from 8B to 11.5B rather than training the larger model from scratch
vs. Web-crawled baselines: Heavily relies on synthetic data (14%) and textbook-style rewriting to compensate for the low quality of the Korean web (Common Crawl)

Limitations

Heavy reliance on synthetic data (14%) raises potential for propagated biases or artifacts if generation models are flawed
Specific quantitative results for Depth-up Scaling efficiency gains vs. scratch training are qualitative
Evaluation is primarily focused on Korean-specific benchmarks; multilingual performance details are limited

Reproducibility

Code: https://huggingface.co/K-intelligence

Models (Base & Mini) are available on Hugging Face (https://huggingface.co/K-intelligence). Training code and specific dataset mixtures are not released. The paper mentions releasing training code in the future but it is not currently available.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation on Korean-specific benchmarks and internal human evaluation

Benchmarks:

KMMLU (Multi-task Language Understanding (Korean))
Internal Evaluations (Language, Humanities, Social Science tasks) [New]

Metrics:

Accuracy
Zero-shot performance scores
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Mi:dm 2.0 claims state-of-the-art performance on Korean benchmarks, though exact comparative numbers against specific external baselines are not provided in the text (Table references exist but values are not in the provided excerpt).

Experiment Figures

Token distribution of the pre-training corpus across data sources and domains

Main Takeaways

Mi:dm 2.0 achieves top-tier zero-shot results on KMMLU, indicating strong general knowledge in Korean contexts
The model demonstrates robust performance in humanities and social sciences, domains heavily represented in the training data
Synthetic data augmentation proved critical for improving performance in STEM and applied sciences, which were originally underrepresented in the organic corpus
The 2.3B Mini model provides a viable alternative for resource-constrained environments, particularly for intent understanding tasks

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture fundamentals
LLM pre-training and instruction tuning
Synthetic data generation strategies

Key Terms

Depth-up Scaling: A method to scale up a model's depth (number of layers) efficiently by initializing new layers based on existing ones, avoiding training from scratch

Korea-centric AI: AI designed to internalize unique Korean values, cognitive frameworks, and commonsense reasoning, beyond simple translation

KMMLU: Korean Massive Multitask Language Understanding—a benchmark for evaluating LLMs on various subjects in Korean

Organic data: Naturally occurring, human-authored text (e.g., web pages, books, news)

Synthetic data: Text generated or augmented by AI models (e.g., translations, rewrites, Chain-of-Thought reasoning)

Chain-of-Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

DuS: Depth-up Scaling—the specific scaling strategy used to grow the 8B model to 11.5B

Perplexity: A measurement of how well a probability model predicts a sample; lower perplexity indicates the model is less surprised by the text