Zero-shot LLM-guided Counterfactual Generation for Text

📝 Paper Summary

Explainability/Interpretability Model Evaluation/Stress-testing

FIZLE leverages instruction-tuned LLMs to generate counterfactual text examples in a zero-shot manner for explaining and stress-testing black-box NLP classifiers without auxiliary training data.

Core Problem

Existing automated counterfactual generation methods are resource-intensive, requiring task-specific datasets, fine-tuning of auxiliary models, or expert human annotation, which is not scalable for new domains.

Why it matters:

Black-box NLP models require careful stress-testing and explainability, especially in high-stakes deployments
Collecting gold-standard counterfactual datasets for every new task is infeasible
Current methods like Polyjuice require sentence-pair datasets for specific control codes, limiting flexibility

Concrete Example: For the input 'This movie is great', a counterfactual explanation might be 'This movie is boring'. Existing methods might struggle to generate this without training on sentiment-flip pairs, whereas an LLM can infer the task from instructions alone.

Key Novelty

Framework for Instructed Zero-shot Counterfactual Generation with LanguagE Models (FIZLE)

Utilizes the inherent instruction-following capabilities of off-the-shelf LLMs (like GPT-4 and Llama 3) to generate counterfactuals without any fine-tuning or few-shot examples
Proposes a 'guided' prompting strategy where the LLM first identifies important features (words) responsible for a label, then edits them, acting as a two-step reasoning process

Architecture

The workflow of the FIZLE framework, detailing the zero-shot prompting pipeline.

Evaluation Highlights

GPT-4o achieves a Label Flip Score of 95% on IMDB in the naive zero-shot setting, significantly outperforming the Polyjuice baseline (89%) while maintaining reasonable textual similarity
The proposed FIZLE-guided pipeline generally produces counterfactuals with higher semantic similarity to the original input compared to naive prompting, verifying the benefit of the two-step feature identification approach
On the AG News dataset, GPT-4o-mini achieves a 96% Label Flip Score, surpassing the BAE baseline (66%) and CheckList (0%)

Breakthrough Assessment

7/10

Offers a strong practical contribution by demonstrating that zero-shot LLMs can replace complex, trained counterfactual generators. While the methodology is straightforward prompting, the rigorous benchmarking against established baselines validates its utility for low-resource settings.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot generation of counterfactual examples to explain or stress-test a black-box classifier f(x)

Inputs: Input text x_i, ground truth label y_hat_i, and optionally the predicted label f(x_i) = y_i

Outputs: Counterfactual text x_i^cf that is minimally perturbed from x_i but results in a different label from the classifier

Pipeline Flow

Input Processing (Tuple creation)
Feature Identification (Guided variant only)
Counterfactual Generation (LLM)
Evaluation (Black-box Classifier)

System Modules

Input Processor

Formats task dataset into tuples of (text, label) for the generator

Model or implementation: N/A (Data processing script)

Feature Identifier

Identifies important words in the input that contribute to the current label (only used in 'Guided' variant)

Model or implementation: LLM (e.g., GPT-4, Llama 3)

Counterfactual Generator

Generates the perturbed text based on instructions and constraints

Model or implementation: LLM (GPT-3.5/4/4o, Llama 2/3)

Black-box Classifier

Predicts label for the generated counterfactual to verify label flip

Model or implementation: DistilBERT (fine-tuned on task)

Novel Architectural Elements

Two-step 'Guided' prompting pipeline where the LLM performs feature attribution before generation to preserve semantic similarity
Zero-shot constraint-based prompting framework specifically for counterfactual generation without auxiliary models

Modeling

Base Model: Evaluated multiple: GPT-3.5-turbo, GPT-4-turbo, GPT-4o, GPT-4o-mini, Llama-2-7B-chat, Llama-2-13B-chat, Llama-3-8B-Instruct

Training Method: Zero-shot prompting (Inference only)

Key Hyperparameters:

temperature: 0.4
top_p: 1.0
repetition_penalty: 1.1

Compute: Open-source models run on two A100 GPUs (80GB). 13B models used 4-bit quantization.

Comparison to Prior Work

vs. Polyjuice: FIZLE requires no auxiliary training data or control code training; completely zero-shot
vs. BAE: FIZLE uses generative LLMs rather than masked token replacement, potentially allowing more fluent rewrites
vs. CheckList: FIZLE generates dynamic examples rather than filling static templates

Limitations

Open-source models (Llama 2) struggle with instruction following in zero-shot settings, often failing to generate valid counterfactuals
Performance on reasoning-heavy tasks (SNLI) is significantly lower than on text classification (IMDB, AG News) for all models
Naive prompting achieves higher label flip rates but often at the cost of lower semantic similarity compared to the guided approach

Reproducibility

Code: https://github.com/AmritaBh/zero-shot-llm-counterfactual

Code and prompts publicly available at GitHub. Open-source models (Llama) and datasets (IMDB, AG News, SNLI) are public. Proprietary models (GPT series) require API access. Black-box classifier (DistilBERT) fine-tuning details implied but standard.

📊 Experiments & Results

Evaluation Setup

Generate counterfactuals for test set instances and evaluate their ability to flip a fine-tuned DistilBERT classifier's prediction

Benchmarks:

IMDB (Sentiment Analysis (Binary Classification))
AG News (News Topic Classification (Multi-class))
SNLI (Natural Language Inference (Entailment, Neutral, Contradiction))

Metrics:

Label Flip Score (LFS) - % of samples where prediction changes
Semantic Similarity (USE) - Cosine similarity of embeddings
Normalized Levenshtein Distance - Edit distance in token space
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Counterfactual Explanation Results: Comparison of LLMs vs Baselines on Label Flip Score (LFS) and Semantic Similarity (Sim).
IMDB	Label Flip Score (LFS)	89.0	95.0	+6.0
IMDB	Semantic Similarity	0.91	0.92	+0.01
AG News	Label Flip Score (LFS)	66.0	96.0	+30.0
SNLI	Label Flip Score (LFS)	82.0	73.0	-9.0
Ablation: Guided vs. Naive prompting. Guided generally preserves similarity better.
IMDB	Semantic Similarity	0.86	0.91	+0.05

Main Takeaways

Proprietary LLMs (GPT-4o, GPT-4) generally outperform open-source models (Llama 2) in zero-shot counterfactual generation.
There is a trade-off between Label Flip Score and textual similarity; Naive prompting often flips labels better but makes larger edits, while Guided prompting preserves meaning better.
LLMs perform well on standard classification (IMDB, AG News) but struggle with NLI (SNLI), likely due to the reasoning required for entailment tasks.
Llama 3 8B shows significant improvement over Llama 2 models, suggesting smaller open-source models are becoming viable for this task.

📚 Prerequisite Knowledge

Prerequisites

Counterfactual examples (minimally edited text that flips model prediction)
Zero-shot learning (performing tasks without specific training examples)
Instruction tuning (fine-tuning LLMs to follow prompt instructions)

Key Terms

FIZLE: Framework for Instructed Zero-shot Counterfactual Generation with LanguagE Models—the authors' proposed pipeline

Label Flip Score: The percentage of generated counterfactuals that successfully change the black-box classifier's prediction

Levenshtein distance: A metric measuring the minimum number of single-character edits required to change one string into another

Universal Sentence Encoder (USE): A model used to compute semantic similarity between the original text and the generated counterfactual in a latent embedding space

Polyjuice: A baseline counterfactual generation method that uses a language model trained on control codes

BAE: A baseline adversarial attack method using BERT-based masked language modeling to perturb text

CheckList: A baseline testing methodology using templates and masked language models for behavioral testing

Hard-prompting: Using fixed, discrete textual templates as prompts (as opposed to learnable soft prompts)

DistilBERT: A smaller, faster, cheaper, and lighter version of BERT used as the black-box classifier to be explained in the experiments