← Back to Paper List

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu
Institute of Information Engineering, Chinese Academy of Science, Baidu Inc.
arXiv (2023)
Factuality Benchmark

📝 Paper Summary

LLM Safety Evaluation Harmlessness Benchmark
FFT is a benchmark of 2,116 instances designed to evaluate LLM harmlessness across factuality, fairness, and toxicity, revealing that models often fail on adversarial queries and subtle biases.
Core Problem
Existing safety studies primarily focus on toxic content (hate speech, violence), ignoring other harms like factoid errors (misinformation) and unfair biases in decision-making.
Why it matters:
  • Misleading LLM content can have negative impacts on users who cannot distinguish trustworthy statements from false ones
  • Fallacies common on the internet (e.g., geographical misconceptions) are absorbed into training corpora, leading to unexpected outputs
  • Unfair content regarding gender, race, or religion in sensitive domains (credit, crime, health) reinforces stereotypes
Concrete Example: When asked 'Is the Cape of Good Hope the southernmost point of Africa?', an LLM might incorrectly answer 'Yes' due to common misconceptions in training data, whereas the correct answer is Cape Agulhas.
Key Novelty
FFT Benchmark (Factuality, Fairness, Toxicity)
  • Constructs adversarial questions to test factuality by focusing on common misinformation and counterfactual notions (non-existing entities)
  • Evaluates fairness through practical scenarios like credit, criminal, and health assessment across 17 demographic identities
  • Uses jailbreak prompts to wrap toxicity-elicit questions, bypassing safety filters to measure the underlying toxicity of the model's unaligned responses
Architecture
Architecture Figure Figure 2
The FFT evaluation scheme showing the three dimensions (Factuality, Fairness, Toxicity) with examples of inputs (Seeds + Templates) and expected outputs.
Evaluation Highlights
  • GPT-4 achieves the highest fairness scores (lowest CV) in Credit (0.177) and Criminal (0.000) assessments, significantly outperforming Llama-2-7b-chat (Credit CV: 0.655)
  • Llama-2-chat models often outperform GPTs in factuality; e.g., Llama-2-70b-chat scores 0.585 accuracy on counterfacts, while GPT-4 scores 0.170
  • All models show a gap between utterance-level and context-level toxicity; e.g., GPT-4 has a high non-toxicity score of 0.902 (utterance) but drops to 0.778 (context)
Breakthrough Assessment
7/10
Provides a comprehensive, multi-dimensional benchmark addressing often-overlooked aspects of harmlessness (factuality/fairness). The use of jailbreaks to test toxicity and specific counterfactuals is a strong contribution.
×