Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia

📝 Paper Summary

Bias Evaluation Low-Resource NLP Multilingual Models

The authors introduce Filipino CrowS-Pairs and WinoQueer—7,074 culturally adapted prompt pairs—to expose significant sexist and homophobic biases in masked and causal multilingual language models.

Core Problem

Most bias benchmarks are English-centric, failing to account for linguistic differences (like gender neutrality in Filipino) and distinct cultural concepts of queerness in Southeast Asia.

Why it matters:

Multilingual models are increasingly deployed in Southeast Asia, but their potential social harms in local contexts remain unmeasured
English benchmarks rely on gendered pronouns (he/she) which do not exist in Filipino (siya), making direct translation ineffective for bias probing
Indigenous Filipino queer identities (e.g., bakla, tomboy) do not map 1:1 onto Western LGBTQ+ labels, rendering English bias datasets culturally irrelevant

Concrete Example: Directly translating 'He/She is a programmer' fails in Filipino because 'he' and 'she' both translate to the gender-neutral 'siya', resulting in identical sentences that cannot measure bias. The authors instead use 'lalaki' (man) and 'babae' (woman) descriptors to reintroduce gender signals.

Key Novelty

Filipino CrowS-Pairs and Filipino WinoQueer

Culturally adapted 7,074 prompt pairs from English CrowS-Pairs and WinoQueer, specifically addressing Filipino's gender-neutral grammar and local queer terminology
First application of bias benchmarks to causal multilingual models (e.g., SeaLLM, Merak-7B) specifically developed for Southeast Asia context
systematic documentation of cultural adaptation challenges (e.g., removing 'Thanksgiving', adapting 'social justice warrior' to 'fighting for too many causes')

Evaluation Highlights

Released 7,074 new Filipino bias evaluation challenge pairs (1,424 for CrowS-Pairs, 5,650 for WinoQueer)
Evaluated masked models (XLM-RoBERTa, mBERT) and causal models (XGLM, Bloom, SeaLLM, Merak, Llama-3, Aya-23), confirming presence of bias across all
Found that for multilingual models, bias magnitude correlates with the volume of pretraining data in the specific language

Breakthrough Assessment

7/10

Significant contribution to low-resource and Southeast Asian NLP fairness. While the methodology adapts existing English frameworks rather than inventing new metrics, the cultural rigor and dataset release fill a critical gap.

⚙️ Technical Details

Problem Definition

Setting: Bias evaluation via pseudo-log-likelihood comparison of minimal pairs (biased vs. less biased sentences)

Inputs: Prompt pairs consisting of a stereotypical/biased sentence and a less biased counterpart

Outputs: Bias score indicating the model's preference (probability) for the stereotypical sentence over the anti-stereotypical one

Pipeline Flow

Cultural Adaptation (English → Filipino)
Prompt Pair Generation (Biased vs. Less Biased)
Model Scoring (Calculate probability of tokens)
Bias Quantification (Compare probabilities)

System Modules

Benchmark Adaptation

Translate and adapt English bias pairs to Filipino, handling gender-neutrality and cultural relevance

Model or implementation: Human annotator (native speaker)

Bias Evaluator

Probe models using the constructed pairs to measure preference for stereotypical sentences

Model or implementation: Various Multilingual PLMs (mBERT, XLM-R, SeaLLM, etc.)

Modeling

Base Model: Evaluated multiple off-the-shelf models: mBERT, XLM-RoBERTa, XGLM, Bloom, SeaLLM-7B-v2, Merak-7B-v4, Llama-3-8B-Instruct, Aya-23-8B

Training Method: Zero-shot evaluation of off-the-shelf pretrained models

Compute: Not reported in the paper

Comparison to Prior Work

vs. Multilingual CrowS-Pairs: Previous works (Steinborn/Reusens) focused on masked models and binary gender; this work includes causal models, Southeast Asian-specific models, and extensive queer identities
vs. Original WinoQueer: Adapted to Filipino identity terms (bakla, bading, tomboy, lesbiyana) rather than direct translation of Western LGBTQ+ spectrum (e.g., pansexual/asexual excluded as culturally less relevant)
vs. Previous Filipino Bias Work: Prior work (Gamboa and Estuar) focused on static embeddings; this targets Transformer-based PLMs

Limitations

Benchmarks rely on crowdsourced stereotypes from Americans, which may not fully capture indigenous Filipino stereotypes despite cultural adaptation
Gender adaptation strategy (adding 'lalaki'/'babae') creates slightly unnatural phrasing compared to native gender-neutral speech
Evaluation is limited to representational harms (stereotyping) and does not cover allocational harms

Reproducibility

Code: https://github.com/gamboalance/filipino_bias_benchmarks

Datasets are publicly available on GitHub. The paper details the specific cultural adaptation rules (e.g., name swapping, pronoun handling, concept mapping) allowing replication of the methodology for other languages.

📊 Experiments & Results

Evaluation Setup

Pseudo-log-likelihood scoring of sentence pairs to determine model preference for stereotypes

Benchmarks:

Filipino CrowS-Pairs (Stereotype detection (Sexist and Homophobic)) [New]
Filipino WinoQueer (Anti-queer bias detection) [New]

Metrics:

Stereotype Score (percentage of pairs where model assigns higher probability to biased sentence)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper releases the dataset statistics but does not provide a results table with specific bias scores for each model in the text body provided. It qualitatively states that models 'contain considerable amounts of bias' and discusses the correlation with pretraining data size. The provided text contains Table 7 with dataset statistics, but no result table with model performance numbers.

Main Takeaways

Filipino CrowS-Pairs and WinoQueer successfully expose bias in both masked and causal multilingual models, filling a gap for Southeast Asian languages
Direct translation of English bias benchmarks is insufficient for languages with different gender systems (like Filipino); structural adaptation (e.g., adding gender descriptors) is necessary
Bias in multilingual models is influenced by the quantity of pretraining data for the specific language (qualitative finding)
Western queer concepts (pansexual, asexual, cisgender) do not always align with local identities (bakla, tomboy), requiring careful semantic mapping

📚 Prerequisite Knowledge

Prerequisites

Understanding of Masked Language Models (MLM) vs. Causal Language Models (CLM)
Familiarity with bias evaluation metrics (stereotype scores)
Basic knowledge of tokenization and log-likelihood scoring

Key Terms

CrowS-Pairs: Crowdsourced Stereotype Pairs—a benchmark dataset of sentence pairs (one stereotypical, one anti-stereotypical) used to measure social bias in language models

WinoQueer: A benchmark dataset derived from the Winograd Schema format, specifically designed to test for anti-LGBTQ+ bias in language models

PLM: Pretrained Language Model—models like BERT or GPT trained on vast amounts of text data

siya: The third-person singular pronoun in Filipino, which is inherently gender-neutral (covering he/she)

bakla/bading: Filipino terms for male individuals with female identities/expressions; often encompasses gay, queer, nonbinary, or trans categories in Western parlance

tomboy: Filipino term often referring to non-heterosexual women, transmen, or butch lesbians

masked language model: Models like BERT that are trained to predict missing (masked) words in a sentence

causal language model: Models like GPT that are trained to predict the next token in a sequence