FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

📝 Paper Summary

LLM Safety Evaluation Harmlessness Benchmark

FFT is a benchmark of 2,116 instances designed to evaluate LLM harmlessness across factuality, fairness, and toxicity, revealing that models often fail on adversarial queries and subtle biases.

Core Problem

Existing safety studies primarily focus on toxic content (hate speech, violence), ignoring other harms like factoid errors (misinformation) and unfair biases in decision-making.

Why it matters:

Misleading LLM content can have negative impacts on users who cannot distinguish trustworthy statements from false ones
Fallacies common on the internet (e.g., geographical misconceptions) are absorbed into training corpora, leading to unexpected outputs
Unfair content regarding gender, race, or religion in sensitive domains (credit, crime, health) reinforces stereotypes

Concrete Example: When asked 'Is the Cape of Good Hope the southernmost point of Africa?', an LLM might incorrectly answer 'Yes' due to common misconceptions in training data, whereas the correct answer is Cape Agulhas.

Key Novelty

FFT Benchmark (Factuality, Fairness, Toxicity)

Constructs adversarial questions to test factuality by focusing on common misinformation and counterfactual notions (non-existing entities)
Evaluates fairness through practical scenarios like credit, criminal, and health assessment across 17 demographic identities
Uses jailbreak prompts to wrap toxicity-elicit questions, bypassing safety filters to measure the underlying toxicity of the model's unaligned responses

Architecture

The FFT evaluation scheme showing the three dimensions (Factuality, Fairness, Toxicity) with examples of inputs (Seeds + Templates) and expected outputs.

Evaluation Highlights

GPT-4 achieves the highest fairness scores (lowest CV) in Credit (0.177) and Criminal (0.000) assessments, significantly outperforming Llama-2-7b-chat (Credit CV: 0.655)
Llama-2-chat models often outperform GPTs in factuality; e.g., Llama-2-70b-chat scores 0.585 accuracy on counterfacts, while GPT-4 scores 0.170
All models show a gap between utterance-level and context-level toxicity; e.g., GPT-4 has a high non-toxicity score of 0.902 (utterance) but drops to 0.778 (context)

Breakthrough Assessment

7/10

Provides a comprehensive, multi-dimensional benchmark addressing often-overlooked aspects of harmlessness (factuality/fairness). The use of jailbreaks to test toxicity and specific counterfactuals is a strong contribution.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of LLM generated text for potential harms across three dimensions: Factuality, Fairness, and Toxicity.

Inputs: Natural language queries constructed from seed declarations wrapped in instruction templates (including jailbreak prompts for toxicity).

Outputs: Textual responses from LLMs, which are then scored for accuracy, variation (bias), or toxicity.

Pipeline Flow

Seed Collection (Manual & Auto-generated)
Template Construction (Factuality, Fairness, Toxicity templates)
Query Synthesis (Combine seeds + templates)
LLM Inference (Zero-shot or Few-shot)
Metric Evaluation (Accuracy, CV, Toxicity Scoring)

System Modules

Seed Collection (Data Construction)

Gather core declarations for evaluation

Model or implementation: Human selection + GPT-3.5 assistance

Template Wrapper (Data Construction)

Format seeds into actionable queries

Model or implementation: Rule-based templates

Evaluator

Score model outputs

Model or implementation: Rule-based scripts + Perspective API + GPT-4

Modeling

Base Model: Evaluated 9 models: GPT-4, GPT-3.5-turbo, Llama-2-chat (7b, 13b, 70b), Vicuna-7b-v1.5, Llama-2 (7b, 13b, 70b)

Comparison to Prior Work

vs. Red-teaming benchmark: FFT adds Factuality and Fairness dimensions and wraps toxicity questions in jailbreaks to test unaligned behavior.
vs. Perspective API: FFT uses Perspective API as a sub-metric but adds Context-level toxicity evaluation using GPT-4.
vs. TruthfulQA: FFT specifically targets 'counterfacts' (non-existing entities) and integrates fairness scenarios like credit/crime assessment [not cited in paper].

Limitations

Jailbreak templates might not be universally effective across all future models or updates.
Fairness evaluation relies on synthetic scenarios which may not perfectly reflect real-world biases.
Context-level toxicity evaluation relies on GPT-4, which may have its own biases.
Counterfactual evaluation assumes specific guidelines for correctness (refusal/pointing out fiction) which might penalize creative modes.

Reproducibility

Code: https://github.com/cuishiyao96/FFT

Data and code are publicly available at https://github.com/cuishiyao96/FFT. Benchmark contains 2,116 instances. Evaluation prompts and metric calculations are described.

📊 Experiments & Results

Evaluation Setup

Zero-shot for Factuality/Toxicity/Identity Preference; 3-shot for Credit/Criminal/Health Assessment.

Benchmarks:

FFT Benchmark (Safety/Harmlessness Evaluation) [New]

Metrics:

Accuracy (Factuality)
Coefficient of Variation (CV) (Fairness)
Non-toxicity Score (1 - Toxicity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Factuality results show Llama-2-chat models often outperforming GPT models, particularly on counterfactuals, likely due to GPTs' tendency to follow instructions (sycophancy) rather than refute false premises.
FFT	Accuracy	0.170	0.585	+0.415
FFT	Accuracy	0.509	0.645	+0.136
Fairness results indicate GPT models generally exhibit lower bias (lower CV) across demographics compared to open-source models.
FFT	CV (Coefficient of Variation)	0.655	0.177	-0.478
FFT	CV (Coefficient of Variation)	0.457	0.000	-0.457
Toxicity results highlight the gap between utterance-level and context-level detection.
FFT	Non-toxicity Score	0.902	0.778	-0.124
FFT	Non-toxicity Score	0.724	0.852	+0.128

Main Takeaways

Models are often sycophantic; GPTs tend to follow misleading instructions (low factuality on counterfacts), while Llama-2-chat is more likely to refute them.
Race identities receive the most fair treatment across models compared to gender and religion.
Significant performance gaps exist between misinformation discrimination (True/False) and open-ended generation; models are better at generating correct answers than classifying statements.
Jailbreak prompts effectively reveal underlying toxicity; strict safety guidance in Llama-2-chat leads to better non-toxicity scores than GPTs in this setup.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM hallucination and safety alignment
Familiarity with jailbreak prompting techniques
Basic statistical concepts (Coefficient of Variation)

Key Terms

FFT: Factuality, Fairness, and Toxicity—the three dimensions of the proposed benchmark

Jailbreak prompts: Crafted inputs designed to trick LLMs into bypassing their internal safety/ethical filters

Counterfacts: Non-existing notions (persons, events, organizations) used to test if an LLM hallucinates information

CV: Coefficient of Variation—a metric used here to measure the disparity of predictions across different demographic groups (lower is better/fairer)

Utterance-level toxicity: Explicitly toxic language (malicious words/expressions) detected by tools like Perspective API

Context-level toxicity: Statements that appear harmless in isolation but are toxic when considered in the context of the query

Sycophancy: The tendency of LLMs to generate responses that agree with the user's input view, even if incorrect