Attributing Culture-Conditioned Generations to Pretraining Corpora

📝 Paper Summary

Cultural bias in LLMs Pretraining data attribution Memorization analysis

The MEMOed framework attributes cultural bias in LLM generations to pretraining data patterns, distinguishing between true memorization, diffuse associations, and cross-cultural generalizations driven by frequency imbalances.

Core Problem

LLMs often exhibit cultural biases in open-ended generation, favoring high-frequency cultures and producing templated or inaccurate outputs for marginalized ones, but the link to specific pretraining data patterns is unclear.

Why it matters:

Models default to Western or high-frequency cultural norms, marginalizing real-world diversity
Understanding the root cause (memorization vs. generalization) is necessary for effective mitigation (unlearning or data augmentation)
Current attribution methods often focus on specific facts rather than broader cultural association patterns

Concrete Example: When asked about food for a Japanese neighbor, a model generates 'Miso Soup' (memorized). When asked about a low-frequency culture, it might generate generic 'meat' (diffuse association) or incorrectly attribute a 'kimono' to a Korean neighbor (cross-culture generalization).

Key Novelty

MEMOed (MEMOrization from pretraining document) Framework

Classifies generated symbols into four categories: Memorized (grounded in data), Diffuse (generic high-frequency terms), Cross-culture Generalization (misattributed memorization), and Weak Association (conceptual synthesis)
Uses 'contributory documents' analysis to measure if a symbol-culture pair appears close together with high relevance in pretraining data, distinguishing rote memorization from other behaviors

Evaluation Highlights

Found that 46% of food symbols and 26% of clothing symbols generated by OLMo-7B are due to direct memorization of pretraining data
Memorized associations strongly correlate with culture frequency in pretraining data, leaving low-frequency cultures relying on generic symbols
Identified 'Diffuse Association' symbols (e.g., 't-shirt') that appear in generations for >50% of cultures despite lacking specific cultural ties

Breakthrough Assessment

7/10

Provides a rigorous framework for attributing cultural generation behavior to pretraining data. While the scope is limited to one model (OLMo-7B), the taxonomy of associations (Memorized vs. Diffuse vs. Cross-culture) is a valuable analytical tool.

⚙️ Technical Details

Problem Definition

Setting: Open-ended text generation conditioned on a specific cultural identity

Inputs: A prompt specifying a culture (e.g., 'My neighbor is [culture]. At dinner, probably likes to eat...')

Outputs: Generated text containing cultural entities (symbols)

Pipeline Flow

Generation Collection (Generate text for 110 cultures)
Symbol Extraction (Extract entities using Llama-3-70b)
Pretraining Data Search (Find documents containing culture+symbol)
Relevance Filtering (Filter documents using d_TOK and d_SNR)
Classification (Categorize association type based on Contribution Score)

System Modules

Generation Collection

Generate culture-conditioned text for food and clothing topics

Model or implementation: OLMo-7B

Symbol Extraction

Extract specific cultural symbols (e.g., 'sushi') from generated sentences

Model or implementation: Llama-3-70b-instruct

MEMOed Classifier

Determine if a symbol is memorized, diffuse, cross-culture, or weak association

Model or implementation: Algorithmic Analysis (using Infinigram index)

Novel Architectural Elements

MEMOed attribution logic: Combines token distance (d_TOK) and signal-to-noise ratio (d_SNR) to strictly define 'contributory documents' for cultural memorization

Modeling

Base Model: OLMo-7B

Training Method: Pretraining analysis only (no new training)

Training Data:

Dolma dataset (3 trillion tokens) used for analysis of pretraining statistics

Key Hyperparameters:

max_seq_len: 2048 (used as upper bound for token distance)
temperature: 1.0 (generation)
top_p: 0.95 (generation)
+ 2 more
top_k: 50 (generation)
max_tokens: 30 (generation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Carlini et al. (Prompting): MEMOed focuses on attributing specific culture-symbol pairs in downstream generation rather than just extracting verbatim training data
vs. Zhang et al. (Frequency Analysis): MEMOed adds the 'contributory document' filter (distance + relevance) rather than relying solely on raw corpus frequency counts
vs. Naous et al. (Cultural Bias): Moves beyond detecting bias existence to explaining *why* it happens via pretraining data attribution [not cited in paper as direct attribution baseline, but as bias detection work]

Limitations

Analysis constrained to open-source models with searchable pretraining data (OLMo-7B); cannot be easily applied to closed models like GPT-4
Focuses only on food and clothing topics; cultural bias extends to values, norms, and reasoning
Does not claim results reflect real-world cultural prevalence, only trends within the specific pretraining dataset (Dolma)

Reproducibility

Code: https://github.com/huihanlhh/CultureGenAttr

publicly available (https://github.com/huihanlhh/CultureGenAttr). Analysis relies on OLMo-7B and Dolma (both open source). Infinigram API used for efficient data search.

📊 Experiments & Results

Evaluation Setup

Analyze 33,000 generations (110 cultures * 300 samples) to classify symbol sources

Benchmarks:

Culture-Conditioned Generation (Custom) (Open-ended text generation) [New]

Metrics:

Percentage of symbols classified as Memorized / Diffuse / Cross-Culture / Weak
Contribution Score (Cs)
Overshadowing Ratio
Statistical methodology: Z-score thresholding (>2.6) to identify statistically significant associations in contribution score distributions

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Breakdown of generation sources shows a significant portion relies on memorization, with differences between topics.
Food Generation	Percentage Memorized	Not reported in the paper	46	N/A
Clothing Generation	Percentage Memorized	Not reported in the paper	26	N/A
Diffuse Association	Prevalence	Not reported in the paper	50	N/A

Experiment Figures

Distribution of Contribution Scores across cultures for specific symbols (e.g., 'sushi')

Main Takeaways

Memorized associations correlate strongly with a culture's frequency in pretraining data; low-frequency cultures produce zero memorized symbols.
Models resort to 'Diffuse Associations' (generic terms like 'meat' or 'shirt') when they lack specific memorized knowledge, often overshadowing specific cultural symbols.
Cross-culture generalization occurs where a symbol memorized for a high-frequency culture (e.g., Japan) is generated for a correlated culture (e.g., Korea).
Weak association generalization involves the model synthesizing broad concepts (e.g., 'robe') from memorized specific symbols (e.g., 'kimono'), showing some capability to generalize beyond rote memorization.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) pretraining and dataset composition
Familiarity with concept of memorization in LLMs
Basic knowledge of tokenization and n-grams

Key Terms

symbol: An entity (e.g., 'kimono', 'pizza') mentioned in a culture-conditioned generation

MEMOed: MEMOrization from pretraining document—the proposed framework to classify if a generated symbol results from memorizing training data

d_TOK: Minimum token distance—a metric measuring the number of subtokens between a culture term and a symbol term in a document

d_SNR: Document-Signal to Noise Ratio—log ratio of the frequency of the target culture to the sum of all other cultures in a document

Dolma: The open-sourced pretraining dataset used to train the OLMo model

Infinigram: An engine/API used to index and search n-gram frequencies in large datasets like Dolma

OLMo-7B: Open Language Model—a fully open-source LLM with accessible pretraining data and code

LDA: Latent Dirichlet Allocation—a generative statistical model used for topic modeling to find common themes in documents