← Back to Paper List

Krutrim LLM: Multilingual Foundational Model for over a Billion People

Aditya Kallappa, Palash Kamble, Abhinav Ravi, Akshat Patidar, Vinayak Dhruv, Deepak Kumar, Raghav Awasthi, A. Manjunath, Himanshu Gupta, Shubham Agarwal, K. Ashish, Gautam Bhargava, Chandra Khatri
arXiv.org (2025)
Pretraining RL RAG Benchmark Factuality

📝 Paper Summary

Multilingual Large Language Models Low-resource language adaptation
Krutrim LLM is a 7-billion parameter model pre-trained on 2 trillion tokens with a custom Indic tokenizer and fine-tuned to address linguistic scarcity and cultural nuances in Indian languages.
Core Problem
Existing foundation models (e.g., LLaMA, GPT-3.5) perform poorly on Indic languages due to data scarcity (1% of Common Crawl) and inefficient tokenization, leading to cultural bias and high inference costs.
Why it matters:
  • India represents 18% of the global population but is underrepresented in digital corpora (only 1% of Common Crawl)
  • Standard tokenizers fracture Indic scripts into excessive tokens, increasing computational cost and degrading context handling
  • Western-centric models often fail to capture India's specific cultural nuances, oral traditions, and code-mixing behaviors
Concrete Example: The paper notes that Sanskrit allows for virtually infinite compound words, which Western-designed tokenizers struggle to process efficiently, resulting in excessively long sequences that hamper model effectiveness compared to English.
Key Novelty
Krutrim LLM (India-centric Foundation Model)
  • Trained on the largest known dataset of Indic tokens (hundreds of billions) within a 2 trillion token corpus to mitigate data scarcity
  • Utilizes a custom sentencepiece tokenizer built from scratch specifically to handle the complex morphology and compound words of Indic languages
  • Incorporates India-centric Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align with local cultural values and safety norms
Evaluation Highlights
  • Outperforms LLaMA-2 on 10 out of 16 English benchmarks with an average score of 0.57 vs 0.55
  • Surpasses or matches state-of-the-art models on Indic language benchmarks despite being smaller in training FLOPs
  • Training involved 10^23 FLOPs using H100 GPUs
Breakthrough Assessment
7/10
Significant for creating a dedicated tokenizer and large-scale dataset for underrepresented Indic languages, though the architecture relies on established techniques (Llama-2 style, GQA, AliBi).
×