Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws

📝 Paper Summary

Fairness and Bias in LLMs Reading Comprehension Instruction Tuning

Most observed 'bias' in generative models on reading comprehension tasks stems from general comprehension failures in ambiguous contexts rather than inherent prejudice, and can be mitigated by teaching models to abstain from answering when information is missing.

Core Problem

Current bias evaluations conflate inherent stereotypical bias with generic comprehension flaws. When models answer incorrectly in under-informative contexts, it is often unclear if they are relying on a stereotype or simply hallucinating due to a lack of reasoning capability.

Why it matters:

Conflating bias with flaws leads to misguided mitigation efforts that focus on identity-specific debiasing rather than fixing root reasoning errors
Superficial evaluations fail to distinguish between inherent learned biases and spurious correlations resulting from flawed inferences
Lack of precise definitions and grounding in established resources makes it difficult to measure true progress in fairness

Concrete Example: Given a paragraph about French and Japanese etiquette that doesn't specify who is rude, if a model answers 'French' to 'Who is rude?', prior work labels this as bias. This paper argues it may just be a comprehension failure where the model fails to recognize the answer is 'not in background', as evidenced by similar failures on non-identity questions.

Key Novelty

Implicit Stereotype Mitigation via Ambiguity-Aware Instruction Tuning

Disentangles 'bias' (identity-specific stereotypes) from 'flaws' (general comprehension failures) by comparing model performance on ambiguous vs. disambiguous contexts across both fairness and general utility benchmarks
Proposes a mitigation framework that uses ONLY general-purpose data (SQuAD, TriviaQA) to teach models to answer 'Not in background' for ambiguous questions, effectively reducing stereotypical hallucinations without explicit debiasing

Architecture

Contrast between 'Bias' and 'Flaws'. Left: Model hallucinates a stereotype ('French') when answer is not present. Right: Model hallucinates a random entity ('Norman') when answer is not present.

Evaluation Highlights

Reduces stereotypical outputs by over 60% across multiple dimensions (nationality, age, gender, etc.) by addressing comprehension failures
Identifies that models default to known stereotypes only ~18.5% of the time in ambiguous contexts, while the rest are random flawed correlations
Demonstrates that models struggle with ambiguous contexts generally (e.g., Llama2-13B gets 5.69% EMO on ambiguous BBQ vs 49.22% on disambiguous)

Breakthrough Assessment

8/10

Offers a critical reframing of the 'bias' problem in LLMs, shifting focus from surface-level debiasing to fundamental reasoning capabilities. The finding that 'bias' is largely 'hallucination under ambiguity' is significant.

⚙️ Technical Details

Problem Definition

Setting: Reading comprehension (RC) where contexts can be ambiguous (answer not present) or disambiguous (answer present)

Inputs: Context paragraph C, Question Q

Outputs: Answer A (either a text span from C or an abstention token like 'Not in background')

Pipeline Flow

Input Processing (Context + Question + Instruction)
Generative Model (Produces Answer)
Evaluation (EMO or Bias Reinforcement check)

System Modules

Input Processing

Formats the raw RC dataset into instruction-based prompts

Model or implementation: Template-based formatter

Generative Model

Generates the answer or abstention token

Model or implementation: Various (Llama2, Mistral, Mixtral, Phi-2)

Novel Architectural Elements

Evaluation framework specifically designed to contrast performance on ambiguous vs. disambiguous inputs to isolate bias from flaws

Modeling

Base Model: Llama2-7B, Llama2-13B, Mistral-7B, Mixtral 8x7B, Phi-2

Training Method: Instruction Fine-Tuning (Full)

Objective Functions:

Purpose: Minimize prediction error on general purpose QA pairs.

Formally: Standard Cross-Entropy Loss over the target tokens.

Adaptation: Full fine-tuning (implied by context of standard instruction tuning)

Trainable Parameters: All parameters (implied)

Training Data:

SQuAD-v2 (ambiguous and disambiguous)
TriviaQA (disambiguous only, synthetically augmented to create ambiguous examples by removing answer spans)

Key Hyperparameters:

epochs: 5-10
batch_size: 1-2
optimizer: Adam
+ 1 more
learning_rate_scheduler: Linear with warm-up ratio 0.01

Compute: NVIDIA RTX 8000 GPUs with 48 GB RAM

Comparison to Prior Work

vs. StereoSet/CrowS-Pairs: Focuses on Reading Comprehension (BBQ) rather than sentence completion to test reasoning
vs. Self-debiasing: Mitigates bias implicitly by improving general reasoning (ambiguity detection) rather than explicitly targeting identity terms
vs. Augmented datasets: Uses generic datasets (SQuAD, TriviaQA) unrelated to stereotypes, showing that general reasoning improvements transfer to fairness tasks

Limitations

Evaluation primarily focuses on regional/nationality stereotypes (though claims reduction across dimensions)
Relies on the assumption that 'Not in background' is the optimal response for all ambiguous queries, which may not hold for all conversational nuances
Analysis restricted to the first sentence of generated responses

Reproducibility

Code: https://github.com/AkshitaJha/biased_or_flawed

Code and instruction-tuning dataset are publicly released. Training hyperparameters (epochs, batch size, optimizer) are provided. Exact synthetic data generation process for TriviaQA is described.

📊 Experiments & Results

Evaluation Setup

Zero-shot Reading Comprehension on ambiguous and disambiguous contexts

Benchmarks:

BBQ (Bias evaluation in Reading Comprehension)
SQuAD-v2 (General Purpose Reading Comprehension (Utility))

Metrics:

Exact Match Overlap (EMO)
Bias Reinforcement (percentage of errors aligning with stereotypes)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Initial analysis reveals a massive performance gap between disambiguous and ambiguous contexts across all models, suggesting the 'bias' is largely a failure to handle ambiguity.
BBQ	EMO	82.21	13.62	-68.59
BBQ	EMO	80.17	6.88	-73.29
BBQ (Ambiguous)	Bias Reinforcement	100.00	18.50	-81.50
Generalizability checks on SQuAD-v2 confirm the flaw is task-specific (ambiguity) not identity-specific.
SQuAD-v2	EMO	50.36	7.90	-42.46

Main Takeaways

Models are not necessarily inherently biased; they are fundamentally flawed at recognizing when they don't know an answer (ambiguity).
Addressing this general flaw via instruction tuning on generic data (SQuAD/TriviaQA) reduces stereotype reliance by >60% on fairness benchmarks (BBQ) without ever seeing fairness data.
The proposed method preserves utility on answerable questions while correctly causing abstention on unanswerable ones.

📚 Prerequisite Knowledge

Prerequisites

Reading Comprehension (RC) task structure
Instruction Tuning / Fine-tuning
Bias benchmarks (BBQ, StereoSet)

Key Terms

EMO: Exact Match Overlap—a metric calculating the percentage of token overlap between predicted and ground truth answers, stricter than F1

Ambiguous context: A scenario where the provided text does not contain sufficient information to answer the question (correct answer is 'Not in background')

Disambiguous context: A scenario where the provided text contains sufficient information to answer the question accurately

Bias Reinforcement: A metric measuring the percentage of times a model's incorrect answer aligns with a known stereotype defined in the BBQ dataset

BBQ: Bias Benchmark for Question Answering—a dataset evaluating stereotypes against minority groups in reading comprehension

SQuAD-v2: Stanford Question Answering Dataset version 2—a reading comprehension benchmark including unanswerable questions

Instruction-tuning: Fine-tuning a pre-trained model on datasets formatted as instructions (e.g., 'Answer the question...') to improve task adherence