Krutrim LLM: Multilingual Foundational Model for over a Billion People

📝 Paper Summary

Multilingual Large Language Models Low-resource language adaptation

Krutrim LLM is a 7-billion parameter model pre-trained on 2 trillion tokens with a custom Indic tokenizer and fine-tuned to address linguistic scarcity and cultural nuances in Indian languages.

Core Problem

Existing foundation models (e.g., LLaMA, GPT-3.5) perform poorly on Indic languages due to data scarcity (1% of Common Crawl) and inefficient tokenization, leading to cultural bias and high inference costs.

Why it matters:

India represents 18% of the global population but is underrepresented in digital corpora (only 1% of Common Crawl)
Standard tokenizers fracture Indic scripts into excessive tokens, increasing computational cost and degrading context handling
Western-centric models often fail to capture India's specific cultural nuances, oral traditions, and code-mixing behaviors

Concrete Example: The paper notes that Sanskrit allows for virtually infinite compound words, which Western-designed tokenizers struggle to process efficiently, resulting in excessively long sequences that hamper model effectiveness compared to English.

Key Novelty

Krutrim LLM (India-centric Foundation Model)

Trained on the largest known dataset of Indic tokens (hundreds of billions) within a 2 trillion token corpus to mitigate data scarcity
Utilizes a custom sentencepiece tokenizer built from scratch specifically to handle the complex morphology and compound words of Indic languages
Incorporates India-centric Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align with local cultural values and safety norms

Evaluation Highlights

Outperforms LLaMA-2 on 10 out of 16 English benchmarks with an average score of 0.57 vs 0.55
Surpasses or matches state-of-the-art models on Indic language benchmarks despite being smaller in training FLOPs
Training involved 10^23 FLOPs using H100 GPUs

Breakthrough Assessment

7/10

Significant for creating a dedicated tokenizer and large-scale dataset for underrepresented Indic languages, though the architecture relies on established techniques (Llama-2 style, GQA, AliBi).

⚙️ Technical Details

Problem Definition

Setting: Multilingual Causal Language Modeling

Inputs: Input text sequence in English or Indic languages

Outputs: Next token prediction / Generated text response

Pipeline Flow

Custom Tokenizer (Indic + English)
Krutrim LLM (Decoder-only Transformer)
Web Search Integration (Optional for app)

System Modules

Tokenizer

Encodes text into tokens, optimized for Indic scripts to reduce sequence length

Model or implementation: SentencePiece (BPE)

Krutrim LLM

Autoregressive language generation

Model or implementation: 7B Parameter Transformer

WebRAG Integration

Enhances factual accuracy for the chat app (external to the core model weights)

Model or implementation: Search + Retrieval Augmented Generation

Novel Architectural Elements

Custom tokenizer explicitly trained from scratch for English and Indic languages to minimize token-to-word ratio for Indian scripts

Modeling

Base Model: Krutrim-7B (Custom trained, Llama-2 style architecture)

Training Method: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Pre-training.

Formally: Next word prediction (Standard Causal Language Modeling loss)
Purpose: Alignment.

Formally: Direct Preference Optimization (DPO) loss to align with safety and cultural preferences

Trainable Parameters: 7 billion

Training Data:

2 Trillion tokens total
Hundreds of billions of Indic tokens
Sources: OpenWeb, RedPajama subset, Books, PubMed, Wiki, StackFast, NDL (National Digital Library)
Instruction tuning data: Translation, Summarization, CoT, Dialogues, Safety, Coding

Key Hyperparameters:

context_length: 4096
dpo_learning_rate: 5e-7
dpo_beta: 0.1
+ 1 more
training_flops: 10^23

Compute: Trained on H100 GPUs

Comparison to Prior Work

vs. Llama-2: Krutrim uses a custom tokenizer and pre-training on large-scale Indic data, whereas Llama-2 is English-centric
vs. OpenHathi/Airavata: Krutrim is a foundation model pre-trained from scratch/CPT on 2T tokens, whereas others are typically PEFT/LoRA fine-tunes of Llama-2
vs. Gemini/Bard: Krutrim focuses specifically on Indian cultural nuances and lower-resource Indic dialects often overlooked by global models

Limitations

Base instruction-tuned model hallucinated approx 33% of the time on factual questions before specific factual-SFT interventions
Evaluation details for Indic benchmarks are qualitative or aggregated (e.g., 'surpassing or at par') without specific breakdown tables in the provided text
Detailed breakdown of the 2T token dataset composition (exact percentages per language) is not provided

Reproducibility

Code and model weights are not provided in the paper. The model is accessible via a conversational app (chat.olakrutrim.com). Training data sources are described generally (OpenWeb, NDL) but the specific curated Indic dataset is proprietary.

📊 Experiments & Results

Evaluation Setup

Comparison against standard English and Indic baselines

Benchmarks:

English Benchmarks (General Language Understanding (16 tasks))
Indic Language Benchmarks (Multilingual proficiency)

Metrics:

Average score (English benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
English Benchmarks (Average of 16 tasks)	Average Score	0.55	0.57	+0.02
Adversarial/Factual queries	Hallucination Rate (Pre-Factual SFT)	0	33	+33

Experiment Figures

Training loss curve during pre-training

Training loss curve during Supervised Fine-Tuning (SFT)

Main Takeaways

Krutrim matches or slightly exceeds Llama-2 on English tasks despite being optimized for multilingual/Indic contexts.
The use of a custom tokenizer and extensive Indic pre-training data is pitched as the primary driver for performance parity/gains.
Direct Preference Optimization (DPO) was essential for aligning the model with safety guidelines and reducing hallucinations.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Tokenization (BPE/SentencePiece)
Reinforcement Learning from Human Feedback (RLHF) / DPO

Key Terms

DPO: Direct Preference Optimization—a method to align language models with human preferences without a separate reward model, used here for safety and cultural alignment

GQA: Grouped Query Attention—an attention mechanism that groups query heads to reduce memory bandwidth usage during inference

AliBi: Attention with Linear Biases—a positional encoding method that allows models to extrapolate to longer sequence lengths than seen during training

SFT: Supervised Fine-Tuning—training the pre-trained model on labeled instruction-response pairs to teach it how to follow instructions

CPT: Continual Pre-training—further training a base model on domain-specific or new language data to add capabilities without starting from scratch

token to word ratio: A measure of tokenizer efficiency; a high ratio means a single word is broken into many tokens, increasing compute cost and reducing effective context window

Common Crawl: A massive open repository of web crawl data, often used as the primary source for training large language models