IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

📝 Paper Summary

Low-resource NLP Multilingual Data Curation

IndicLLMSuite addresses the scarcity of Indian language resources by releasing 251 billion tokens of pre-training data and 74 million instruction pairs created via a novel pipeline combining human verification and synthetic generation.

Core Problem

Low and mid-resource languages lack tailored data resources for LLM development because standard open-source pipelines fail to effectively curate diverse sources (websites, PDFs, videos) and cannot leverage high-quality model-generated instructions due to the absence of capable existing models.

Why it matters:

Languages spoken by over 1.4 billion people (Indian sub-continent) are minimally represented in current open-source LLMs (e.g., Llama 2, Mistral), excluding rich cultural contexts.
The 'chicken and egg' problem persists: lack of high-quality LLMs prevents the generation of synthetic data (like Self-Instruct) needed to train better LLMs.
Existing pipelines like CommonCrawl contain noise, mislabeled languages, and defunct URLs, requiring specific curation for lower-resource languages.

Concrete Example: Standard language identification tools often mislabel closely related Indic languages (e.g., Hindi and Marathi). Furthermore, many URLs in existing datasets like mC4 are now defunct or contain machine-translated content that degrades model quality, necessitating the manual verification strategy employed in this work.

Key Novelty

Verified-Synthetic Hybrid Curation (Sangraha & IndicAlign)

Combines manually verified sources (Sangraha Verified) with large-scale synthetic translation of English content to balance high quality with the volume required for LLM training.
Utilizes a custom Spark-based distributed pipeline (Setu) specifically designed to handle diverse Indic sources including OCR for PDFs and transcription for videos, unlike text-only web crawlers.
Generates culture-grounded instruction data by prompting English LLMs (Llama 2, Mixtral) with India-centric Wikipedia articles, creating synthetic conversations rooted in local context.

Architecture

Overview of the IndicLLMSuite contributions, categorized into Pre-training (Sangraha), Pipeline (Setu), and Fine-tuning (IndicAlign) components.

Evaluation Highlights

Release of Sangraha: A massive pre-training corpus containing 251 billion tokens across 22 Indian languages.
Release of IndicAlign-Instruct: A collection of 74.8 million instruction-response pairs created via translation, aggregation, and synthetic generation.
Release of IndicAlign-Toxic: 123,000 toxic prompt and non-toxic response pairs for safety alignment.

Breakthrough Assessment

8/10

A foundational resource contribution that significantly lowers the barrier for building LLMs in Indian languages. While it doesn't propose a new model architecture, the scale (251B tokens) and rigorous curation pipeline are major advancements for the field.

⚙️ Technical Details

Problem Definition

Setting: Creation of large-scale pre-training and fine-tuning datasets for 22 scheduled Indian languages

Inputs: Diverse raw data sources: URLs, PDFs, Videos, existing English/Multilingual corpora

Outputs: Cleaned, deduplicated, and tokenized text datasets (Sangraha) and instruction pairs (IndicAlign)

Pipeline Flow

Source Discovery: URL curation via search/news repositories -> Human Verification
Extraction: Setu Pipeline (Crawling / PDF OCR / Video Transcription)
Processing: Cleaning -> Language Identification -> Toxicity Filtering -> Deduplication
Synthetic Generation: Translation of English datasets -> Wiki-grounded conversation generation
Compilation: Aggregation into Sangraha (Pre-training) and IndicAlign (Instruction)

System Modules

Setu Pipeline

Distributed extraction and cleaning of text from heterogeneous sources

Model or implementation: Apache Spark-based custom pipeline

Translation Engine (Synthetic Generation)

Translate high-quality English datasets into Indian languages

Model or implementation: Open-source MT models (exact model not specified in snippet, likely NLLB or IndicTrans2 based on authors' prior work)

Instruction Generator (Synthetic Generation)

Generate context-grounded conversations from encyclopedia articles

Model or implementation: Llama 2 and Mixtral (English models)

Toxicity Aligner (Synthetic Generation)

Generate safety alignment data

Model or implementation: Aligned Llama 2 model

Novel Architectural Elements

Integration of human verification loop before crawling (rejecting 38-80% of URLs) to prevent garbage-in-garbage-out, unlike standard automated Common Crawl dumps
Unified Spark pipeline handling three distinct modalities (Web, PDF, Video) simultaneously for low-resource language aggregation

Comparison to Prior Work

vs. mC4/OSCAR: IndicLLMSuite employs manual human verification of sources *before* crawling, whereas mC4/OSCAR rely on post-hoc filtering of automated crawls
vs. MADLAD-400: Focuses specifically on Indic languages (22 languages) with specialized OCR and video transcription pipelines, rather than global coverage
vs. Samanantar: Extends beyond translation aggregation to include large-scale web crawling, PDF extraction, and synthetic instruction generation [not cited in paper as comparison, but Samanantar is an older dataset by same group]

Limitations

Dependency on synthetic (translated) data for a large portion of the corpus (162B out of 251B tokens), which may introduce translation artifacts.
Reliance on English LLMs (Llama 2, Mixtral) for generating instructions implies the 'Indian context' is filtered through an English-centric model's worldview.
Verification process is manual and resource-intensive, potentially limiting scalability compared to fully automated pipelines like CommonCrawl.

Reproducibility

Code: https://github.com/AI4Bharat/IndicLLMSuite

All datasets (Sangraha, IndicAlign), the Setu pipeline code, and the custom tokenizer are publicly released at https://github.com/AI4Bharat/IndicLLMSuite. The paper mentions utilizing a custom tokenizer with fertility 1.3-2.79. Specific model weights for the translation models used are not explicitly named in the snippet but are described as 'open-source translation model'.

📊 Experiments & Results

Evaluation Setup

Dataset Construction and Statistical Analysis

Metrics:

Token count
Instruction-response pair count
Tokenizer fertility
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Outcomes of the manual verification process for website URLs, showing the proportion of accepted vs. rejected sites.

Main Takeaways

Created 'Sangraha', the largest open-source pre-training dataset for Indic languages to date, comprising 251 Billion tokens across 22 languages.
The 'Verified' subset (64B tokens) represents a high-quality core derived from manually approved URLs, PDFs, and videos, addressing the quality issues prevalent in purely web-crawled datasets like mC4.
Synthetic data generation via translation and LLM-grounding (162B tokens) successfully scales up the available volume for low-resource languages where native digital content is scarce.
The 'IndicAlign-Instruct' dataset (74.8M pairs) bridges the gap in fine-tuning resources by adapting high-quality English instruction methodologies (like Wikihow-grounded conversations) to Indian languages.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM pre-training vs. fine-tuning
Familiarity with web crawling and data cleaning pipelines (Common Crawl, OSCAR)
Knowledge of synthetic data generation techniques (Self-Instruct, translation-based augmentation)

Key Terms

OCR: Optical Character Recognition—technology to convert images of text (like PDFs) into machine-readable text formats

Fertility: A tokenizer metric indicating the average number of tokens required to represent a word; lower fertility suggests better tokenization efficiency for that language

Spark: Apache Spark—a unified analytics engine for large-scale data processing, used here to distribute the data cleaning pipeline

IFT: Instruction Fine-Tuning—training a model on prompt-response pairs to teach it to follow user instructions

LID: Language Identification—automated tools used to classify the language of a given text segment

Synthetic Data: Data generated by other AI models (e.g., translating English data or asking an LLM to generate questions) rather than human writing

Sangraha: The name given to the pre-training dataset suite in this paper (Sanskrit for 'Collection')

Setu: The name given to the data processing pipeline in this paper (Sanskrit for 'Bridge')