← Back to Paper List

Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws

A Jha, S Kabra, CK Reddy
arXiv, 12/2024 (2024)
QA Factuality

📝 Paper Summary

Fairness and Bias in LLMs Reading Comprehension Instruction Tuning
Most observed 'bias' in generative models on reading comprehension tasks stems from general comprehension failures in ambiguous contexts rather than inherent prejudice, and can be mitigated by teaching models to abstain from answering when information is missing.
Core Problem
Current bias evaluations conflate inherent stereotypical bias with generic comprehension flaws. When models answer incorrectly in under-informative contexts, it is often unclear if they are relying on a stereotype or simply hallucinating due to a lack of reasoning capability.
Why it matters:
  • Conflating bias with flaws leads to misguided mitigation efforts that focus on identity-specific debiasing rather than fixing root reasoning errors
  • Superficial evaluations fail to distinguish between inherent learned biases and spurious correlations resulting from flawed inferences
  • Lack of precise definitions and grounding in established resources makes it difficult to measure true progress in fairness
Concrete Example: Given a paragraph about French and Japanese etiquette that doesn't specify who is rude, if a model answers 'French' to 'Who is rude?', prior work labels this as bias. This paper argues it may just be a comprehension failure where the model fails to recognize the answer is 'not in background', as evidenced by similar failures on non-identity questions.
Key Novelty
Implicit Stereotype Mitigation via Ambiguity-Aware Instruction Tuning
  • Disentangles 'bias' (identity-specific stereotypes) from 'flaws' (general comprehension failures) by comparing model performance on ambiguous vs. disambiguous contexts across both fairness and general utility benchmarks
  • Proposes a mitigation framework that uses ONLY general-purpose data (SQuAD, TriviaQA) to teach models to answer 'Not in background' for ambiguous questions, effectively reducing stereotypical hallucinations without explicit debiasing
Architecture
Architecture Figure Figure 1 (implied conceptual flow)
Contrast between 'Bias' and 'Flaws'. Left: Model hallucinates a stereotype ('French') when answer is not present. Right: Model hallucinates a random entity ('Norman') when answer is not present.
Evaluation Highlights
  • Reduces stereotypical outputs by over 60% across multiple dimensions (nationality, age, gender, etc.) by addressing comprehension failures
  • Identifies that models default to known stereotypes only ~18.5% of the time in ambiguous contexts, while the rest are random flawed correlations
  • Demonstrates that models struggle with ambiguous contexts generally (e.g., Llama2-13B gets 5.69% EMO on ambiguous BBQ vs 49.22% on disambiguous)
Breakthrough Assessment
8/10
Offers a critical reframing of the 'bias' problem in LLMs, shifting focus from surface-level debiasing to fundamental reasoning capabilities. The finding that 'bias' is largely 'hallucination under ambiguity' is significant.
×