The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

📝 Paper Summary

Multilingual Language Models Open Source Foundation Models

Lucie-7B is an open foundation model trained on a massive multilingual dataset with equal French and English representation, prioritizing data rights and transparency.

Core Problem

Mainstream LLMs like Llama are heavily English-centric, leading to poor cultural and linguistic performance for other languages, while often training on copyrighted data without transparency.

Why it matters:

English-centric bias results in models that lack knowledge of specific cultural history, social practices, and everyday activities of non-English communities (e.g., French cooking or history).
Dependence on opaque datasets raises legal and ethical issues regarding copyright, intellectual property, and personally identifying information.

Concrete Example: When asked about history or cooking, an English-centric model is likely to provide answers suited to Anglophone culture rather than French culture, leading to unsatisfying results for French speakers.

Key Novelty

OpenLLM-France Lucie Project

Constructs a training dataset with a 33% French / 33% English split to explicitly offset Anglo-centric bias.
Prioritizes data rights by minimizing copyrighted material and maximizing public domain sources (e.g., Gallica, arguably the largest pre-processed French text collection).
Achieves OSI (Open Source Initiative) compliance by releasing weights, code, and the full training dataset breakdown.

Architecture

Charts showing the distribution of the Lucie Training Dataset by source category and language.

Evaluation Highlights

Lucie-7B-Instruct-v1.1 achieves promising results compared to state-of-the-art models (specific numbers not detailed in the provided text excerpt).
The Lucie Training Dataset is one of the biggest collections of French text data preprocessed for LLM training (40% of the raw data is French).
Foundation model trained on roughly 33% English and 33% French data to balance cultural representation.

Breakthrough Assessment

7/10

Significant contribution to open science and non-English (French) NLP through transparent data curation and OSI compliance, though the model architecture itself is standard.

⚙️ Technical Details

Problem Definition

Setting: Causal language modeling (predicting the next token) on a multilingual corpus

Inputs: Text sequences in English, French, Spanish, German, Italian, or code

Outputs: Next token probabilities / generated text

Pipeline Flow

Data Collection & Curation (Web, Archives, Code)
Preprocessing (OCR filtering, Deduplication, PII removal)
Tokenizer Training
Pretraining (3 stages: Initial, Context Extension, Final)
Instruction Fine-tuning

System Modules

Data Curator

Aggregates data from diverse sources (RedPajama, Gallica, Wikipedia) and balances languages (33% Fr, 33% En)

Model or implementation: Scripts/Pipelines (Custom)

Lucie-7B Foundation

Predict next token across multilingual contexts

Model or implementation: Transformer-based Causal LLM (7B parameters)

Instruction Tuner

Adapt foundation model to follow instructions

Model or implementation: Lucie-7B-Instruct variants

Novel Architectural Elements

Explicit 33/33 French/English data ratio design in pretraining composition to enforce cultural alignment

Modeling

Base Model: Lucie-7B

Training Method: Instruction Fine-Tuning (SFT)

Training Data:

Lucie Training Dataset: 33% French, 33% English, plus Spanish, German, Italian, Code
Sources include RedPajama, FineWebEdu, Gallica (public domain books), Wikipedia, Europarl
Dataset size: ~5.75 billion tokens mentioned for one small subset (Persée), full size implies trillions based on comparison to Llama

Key Hyperparameters:

model_parameters: 7 Billion

Compute: Not reported in the paper

Comparison to Prior Work

vs. Llama: Lucie prioritizes 1:1 French-English balance vs. Llama's heavy English bias
vs. CroissantLLM: Lucie is larger (7B vs 1.3B) and includes broader European language support (DE, ES, IT) [not cited in paper]
vs. ChatGPT: Lucie is a smaller, open foundation model without RLHF/DPO, whereas ChatGPT is massive and closed

Limitations

Does not include alignment via RLHF or DPO (Reinforcement Learning with Human Feedback / Direct Preference Optimization)
Significantly smaller than proprietary state-of-the-art models (e.g., GPT-4)
Relaxed prohibition on web data (needed for scale) despite initial goal of using only curated sources

Reproducibility

Code: https://github.com/OpenLLM-France/Lucie-Training

Highly reproducible. Publicly available: Model weights (Lucie-7B, Instruct variants), intermediate checkpoints, full training dataset (Lucie Training Dataset on Hugging Face), and data preparation/training code (GitHub). One small corpus (0.2%) is not distributed due to copyright but metadata is provided.

📊 Experiments & Results

Evaluation Setup

Evaluation of foundation and instruct models on standard benchmarks

Benchmarks:

General Benchmarks (Various NLP tasks (implied, specific benchmark names not listed in text))

Metrics:

Perplexity (for data filtering)
OCR quality scores
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper focuses on dataset construction and release. Quantitative evaluation metrics for the model's downstream performance are mentioned as 'promising' but specific numbers are not present in the provided text excerpt.

Main Takeaways

Constructed a massive, open, multilingual dataset with a specific focus on French cultural heritage (Gallica, HAL, Theses) to counter Anglo-centric bias.
Demonstrated that strict data rights (prioritizing public domain/open licenses) can be maintained while building foundation models, though some web data relaxation was necessary for scale.
Released one of the first truly OSI-compliant definition models by open-sourcing data, code, and weights simultaneously.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) pretraining pipelines
Familiarity with data curation (deduplication, filtering, OCR)
Knowledge of tokenization and vocabulary size

Key Terms

LLM: Large Language Model—a deep learning model trained on vast amounts of text to generate human-like language

OCR: Optical Character Recognition—technology used to convert images of text (like scanned books) into machine-readable text formats

OSI: Open Source Initiative—an organization that defines standards for what constitutes 'open source' software and AI

SFT: Supervised Fine-Tuning—training a model on labeled examples (instructions and answers) to teach it how to follow user commands

RLHF: Reinforcement Learning with Human Feedback—a method to align models with human preferences using reward signals

DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without a separate reward model

perplexity: A measurement of how well a probability model predicts a sample; lower perplexity indicates the model is less 'surprised' by the text

casual language model: A model trained to predict the next token in a sequence based only on previous tokens

foundation model: A large-scale model trained on broad data that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks

CCNet: A pipeline for extracting high-quality monolingual datasets from web crawl data, often used to filter low-quality text

MinHash: A technique used for estimating the similarity between two sets, commonly used for deduplicating large datasets