On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

📝 Paper Summary

Cultural Bias in LLMs Multilingual Evaluation

Cultural bias against non-Western entities in LMs is driven not just by data imbalance, but by linguistic factors like polysemy and tokenization in the non-Western language.

Core Problem

Language models show strong favoritism toward Western entities when operating in non-Western languages (specifically Arabic), failing to recognize local cultural entities even when they appear frequently in pre-training data.

Why it matters:

Current multilingual LMs struggle to adapt to local cultural contexts, limiting their utility for global communities
Prior work attributes bias mainly to data frequency, overlooking how specific linguistic features (like word senses) in non-English languages exacerbate the problem
Biased performance hinders downstream tasks like Named Entity Recognition and Question Answering in non-Western languages

Concrete Example: When asked to extract the Arab dish 'Makloube' from Arabic text, an LM fails because 'Makloube' also means 'flipped' (adjective), but it successfully extracts the same dish from the English translation. Conversely, it easily extracts 'Lasagna' in both languages because 'Lasagna' has only one sense.

Key Novelty

Linguistic Roots of Cultural Bias Analysis

Introduces CAMeL-2, a parallel Arabic-English benchmark to isolate language effects from cultural knowledge gaps
Identifies that high-frequency Arab entities often suffer performance drops due to polysemy (having multiple meanings in Arabic), unlike Western entities which are usually transliterated monosemous nouns
Demonstrates that frequency-based tokenization worsens bias by merging polysemous Arab entities into single tokens that the model conflates with their non-entity meanings

Architecture

Conceptual illustration of the problem: An LM fails to extract 'Makloube' (Arab food) in Arabic because the word is polysemous, but succeeds in English. It succeeds with 'Lasagna' (Western food) in both languages.

Evaluation Highlights

Llama-3.3-70b shows a 27 F1 point gap between Western and Arab entities in Arabic NER, compared to a much smaller gap in English
QA accuracy on Arab locations drops to ~40-60% for countries with high polysemy rates, while remaining near 90% for Western locations in the same language
Entities tokenized into a single token perform worse than multi-token entities, especially when the single token is a polysemous Arabic word

Breakthrough Assessment

8/10

Significant shift in perspective: moves beyond 'add more data' to showing how linguistic structures (polysemy, script sharing) fundamentally confuse models about cultural entities.

⚙️ Technical Details

Problem Definition

Setting: Cross-lingual and cross-cultural entity extraction and filling

Inputs: Culturally grounded contexts with masked entities (e.g., 'Today's lunch is Arab, I've cooked [MASK]') or QA prompts

Outputs: The specific cultural entity filling the mask or answering the question

Pipeline Flow

Entity Extraction & Annotation (Wikipedia/Wikidata/OSM)
Context Generation (Web/Twitter mining)
Translation (Manual Arabic-to-English)
Evaluation (NER, QA, Text Infilling)

System Modules

CAMeL-2 Construction

Create parallel Arabic-English dataset of cultural entities and contexts

Model or implementation: Human annotators + Scripts

Evaluator

Test LMs on extracting/filling entities in both languages

Model or implementation: Various LMs (Llama-3, XLM-R, etc.)

Novel Architectural Elements

Parallel benchmarking framework: Designed specifically to decouple language (Arabic/English) from content (Cultural Entity) to isolate linguistic bias factors

Modeling

Base Model: Evaluates multiple models: Llama-3.3, Aya-23, Qwen-2.5, AceGPT, JAIS, XLM-R, ARBERT, MARBERT, CAMeLBERT, AraBERT

Training Method: Fine-tuning (for BERT models) and Few-shot/Zero-shot Prompting (for GPT models)

Adaptation: Fine-tuning for NER (BERT models); Prompting for QA (GPT models)

Trainable Parameters: Full model for BERT fine-tuning; Frozen for GPT prompting

Training Data:

ANERCorp (Arabic NER)
CoNLL-2003 (English NER)
Distantly supervised Wikipedia data for uncovered entity types

Compute: Not reported in the paper

Comparison to Prior Work

vs. CAMeL (v1): CAMeL-2 is parallel (Arabic-English), 3x larger (58k entities), and includes longer QA contexts
vs. Standard Bias Benchmarks: Focuses on linguistic drivers (polysemy, tokenization) rather than just data representation imbalance

Limitations

Analysis is specific to Arabic-English; generalization to other non-Western languages not tested
Relies on mC4 to approximate pre-training frequency, which may not match proprietary model training data exactly
Manual annotation of 'culture' can be subjective, though adjudication was used

Reproducibility

Code: https://github.com/tareknaous/camel2

Dataset code is available. The study relies on public models (Llama-3, XLM-R) and public datasets (mC4 for frequency analysis). Exact fine-tuning hyperparameters for the NER models are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Parallel evaluation in Arabic and English on Named Entity Recognition (NER), Extractive QA, and Masked Token Prediction (Text Infilling)

Benchmarks:

CAMeL-2 (Cultural Entity Recognition / QA) [New]

Metrics:

F1 Score (NER)
Exact Match Accuracy (Extractive QA)
Cultural Bias Score (CBS - probability preference for Western vs Arab entities)
Statistical methodology: Cohen's Kappa reported for annotator agreement (0.825). Significance tests for model results not explicitly reported in the paper.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
NER and QA results show a much larger performance gap between cultures when models operate in Arabic compared to English.
CAMeL-2 (Arabic)	Difference in F1 (Western - Arab)	0	27	+27
CAMeL-2 (Arabic)	Difference in Accuracy (Western - Arab)	0	15	+15
Text infilling (CBS) results show models are more biased (prefer Western entities) in Arabic contexts than in English contexts.
CAMeL-2 (Locations)	Cultural Bias Score (CBS)	28	50	+22

Experiment Figures

QA Accuracy vs. Entity Frequency in Pre-training Data (mC4)

QA Accuracy per Arab country vs. Percentage of Polysemous Entities

Performance distribution based on token count per entity

Main Takeaways

LMs exhibit 'Western Bias' more severely in Arabic than in English; the gap shrinks significantly when the same entities are queried in English.
High-frequency Arab entities suffer from 'Polysemy Penalty': entities that are also common words (e.g., 'flipped') are harder to recognize than unique identifiers.
Script sharing hurts performance: Arab entities that appear frequently in Farsi/Urdu (often with different meanings) confuse the model.
Tokenization is a root cause: large Arabic vocabularies lead to polysemous entities being tokenized as single tokens, merging their 'entity' and 'common word' representations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of tokenization (BPE, WordPiece)
Familiarity with Named Entity Recognition (NER) and Extractive QA
Basic knowledge of multilingual language models

Key Terms

Polysemy: The capacity for a word or phrase to have multiple meanings (e.g., 'Makloube' as a dish vs. 'flipped')

Transliteration: Representing a word from one language using the script of another (e.g., writing 'Lasagna' in Arabic script)

CBS (Cultural Bias Score): A metric measuring an LM's likelihood preference for Western over Arab entities in a neutral or Arab-specific context

mC4: Multilingual Colossal Clean Crawled Corpus—a massive dataset used for pre-training multilingual language models

Script sharing: When multiple languages (e.g., Arabic, Farsi, Urdu) use the same writing system, causing lexical overlap

NER: Named Entity Recognition—identifying categories of objects (people, places, organizations) in text

Extractive QA: Question Answering where the model must extract the answer as a span of text from the provided context

Text-infilling: A task where the model predicts missing words (masked tokens) in a sentence

Frequency-based tokenization: Algorithms like BPE that assign unique tokens to frequent character sequences; can merge polysemous words into single tokens