Mohammed Safi Ur Rahman Khan, Priyam Mehta, A. Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, G. Suriyaprasaad, G. VarunBalan, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra
Indian Institute of Technology, Madras,
Microsoft,
National Institute of Information and Communications Technology
Annual Meeting of the Association for Computational Linguistics
(2024)
PretrainingBenchmarkSpeechMM
📝 Paper Summary
Low-resource NLPMultilingual Data Curation
IndicLLMSuite addresses the scarcity of Indian language resources by releasing 251 billion tokens of pre-training data and 74 million instruction pairs created via a novel pipeline combining human verification and synthetic generation.
Core Problem
Low and mid-resource languages lack tailored data resources for LLM development because standard open-source pipelines fail to effectively curate diverse sources (websites, PDFs, videos) and cannot leverage high-quality model-generated instructions due to the absence of capable existing models.
Why it matters:
Languages spoken by over 1.4 billion people (Indian sub-continent) are minimally represented in current open-source LLMs (e.g., Llama 2, Mistral), excluding rich cultural contexts.
The 'chicken and egg' problem persists: lack of high-quality LLMs prevents the generation of synthetic data (like Self-Instruct) needed to train better LLMs.
Existing pipelines like CommonCrawl contain noise, mislabeled languages, and defunct URLs, requiring specific curation for lower-resource languages.
Concrete Example:Standard language identification tools often mislabel closely related Indic languages (e.g., Hindi and Marathi). Furthermore, many URLs in existing datasets like mC4 are now defunct or contain machine-translated content that degrades model quality, necessitating the manual verification strategy employed in this work.
Combines manually verified sources (Sangraha Verified) with large-scale synthetic translation of English content to balance high quality with the volume required for LLM training.
Utilizes a custom Spark-based distributed pipeline (Setu) specifically designed to handle diverse Indic sources including OCR for PDFs and transcription for videos, unlike text-only web crawlers.
Generates culture-grounded instruction data by prompting English LLMs (Llama 2, Mixtral) with India-centric Wikipedia articles, creating synthetic conversations rooted in local context.
Architecture
Overview of the IndicLLMSuite contributions, categorized into Pre-training (Sangraha), Pipeline (Setu), and Fine-tuning (IndicAlign) components.
Evaluation Highlights
Release of Sangraha: A massive pre-training corpus containing 251 billion tokens across 22 Indian languages.
Release of IndicAlign-Instruct: A collection of 74.8 million instruction-response pairs created via translation, aggregation, and synthetic generation.
Release of IndicAlign-Toxic: 123,000 toxic prompt and non-toxic response pairs for safety alignment.
Breakthrough Assessment
8/10
A foundational resource contribution that significantly lowers the barrier for building LLMs in Indian languages. While it doesn't propose a new model architecture, the scale (251B tokens) and rigorous curation pipeline are major advancements for the field.
⚙️ Technical Details
Problem Definition
Setting: Creation of large-scale pre-training and fine-tuning datasets for 22 scheduled Indian languages
Inputs: Diverse raw data sources: URLs, PDFs, Videos, existing English/Multilingual corpora
Outputs: Cleaned, deduplicated, and tokenized text datasets (Sangraha) and instruction pairs (IndicAlign)
Pipeline Flow
Source Discovery: URL curation via search/news repositories -> Human Verification
Extraction: Setu Pipeline (Crawling / PDF OCR / Video Transcription)
Processing: Cleaning -> Language Identification -> Toxicity Filtering -> Deduplication
Synthetic Generation: Translation of English datasets -> Wiki-grounded conversation generation
Compilation: Aggregation into Sangraha (Pre-training) and IndicAlign (Instruction)
System Modules
Setu Pipeline
Distributed extraction and cleaning of text from heterogeneous sources
Model or implementation: Apache Spark-based custom pipeline
Translation Engine (Synthetic Generation)
Translate high-quality English datasets into Indian languages
Model or implementation: Open-source MT models (exact model not specified in snippet, likely NLLB or IndicTrans2 based on authors' prior work)
Instruction Generator (Synthetic Generation)
Generate context-grounded conversations from encyclopedia articles
Model or implementation: Llama 2 and Mixtral (English models)
Toxicity Aligner (Synthetic Generation)
Generate safety alignment data
Model or implementation: Aligned Llama 2 model
Novel Architectural Elements
Integration of human verification loop before crawling (rejecting 38-80% of URLs) to prevent garbage-in-garbage-out, unlike standard automated Common Crawl dumps
Unified Spark pipeline handling three distinct modalities (Web, PDF, Video) simultaneously for low-resource language aggregation
Comparison to Prior Work
vs. mC4/OSCAR: IndicLLMSuite employs manual human verification of sources *before* crawling, whereas mC4/OSCAR rely on post-hoc filtering of automated crawls
vs. MADLAD-400: Focuses specifically on Indic languages (22 languages) with specialized OCR and video transcription pipelines, rather than global coverage
vs. Samanantar: Extends beyond translation aggregation to include large-scale web crawling, PDF extraction, and synthetic instruction generation [not cited in paper as comparison, but Samanantar is an older dataset by same group]
Limitations
Dependency on synthetic (translated) data for a large portion of the corpus (162B out of 251B tokens), which may introduce translation artifacts.
Reliance on English LLMs (Llama 2, Mixtral) for generating instructions implies the 'Indian context' is filtered through an English-centric model's worldview.
Verification process is manual and resource-intensive, potentially limiting scalability compared to fully automated pipelines like CommonCrawl.
All datasets (Sangraha, IndicAlign), the Setu pipeline code, and the custom tokenizer are publicly released at https://github.com/AI4Bharat/IndicLLMSuite. The paper mentions utilizing a custom tokenizer with fertility 1.3-2.79. Specific model weights for the translation models used are not explicitly named in the snippet but are described as 'open-source translation model'.
📊 Experiments & Results
Evaluation Setup
Dataset Construction and Statistical Analysis
Metrics:
Token count
Instruction-response pair count
Tokenizer fertility
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Outcomes of the manual verification process for website URLs, showing the proportion of accepted vs. rejected sites.
Main Takeaways
Created 'Sangraha', the largest open-source pre-training dataset for Indic languages to date, comprising 251 Billion tokens across 22 languages.
The 'Verified' subset (64B tokens) represents a high-quality core derived from manually approved URLs, PDFs, and videos, addressing the quality issues prevalent in purely web-crawled datasets like mC4.
Synthetic data generation via translation and LLM-grounding (162B tokens) successfully scales up the available volume for low-resource languages where native digital content is scarce.
The 'IndicAlign-Instruct' dataset (74.8M pairs) bridges the gap in fine-tuning resources by adapting high-quality English instruction methodologies (like Wikihow-grounded conversations) to Indian languages.
📚 Prerequisite Knowledge
Prerequisites
Understanding of LLM pre-training vs. fine-tuning
Familiarity with web crawling and data cleaning pipelines (Common Crawl, OSCAR)
Knowledge of synthetic data generation techniques (Self-Instruct, translation-based augmentation)
Key Terms
OCR: Optical Character Recognition—technology to convert images of text (like PDFs) into machine-readable text formats
Fertility: A tokenizer metric indicating the average number of tokens required to represent a word; lower fertility suggests better tokenization efficiency for that language
Spark: Apache Spark—a unified analytics engine for large-scale data processing, used here to distribute the data cleaning pipeline
IFT: Instruction Fine-Tuning—training a model on prompt-response pairs to teach it to follow user instructions
LID: Language Identification—automated tools used to classify the language of a given text segment
Synthetic Data: Data generated by other AI models (e.g., translating English data or asking an LLM to generate questions) rather than human writing
Sangraha: The name given to the pre-training dataset suite in this paper (Sanskrit for 'Collection')
Setu: The name given to the data processing pipeline in this paper (Sanskrit for 'Bridge')