FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM-Based Recommender Systems

📝 Paper Summary

LLM-based Recommendation Systems Fairness and Bias Mitigation Conformal Prediction

FACTER uses conformal prediction to set adaptive fairness thresholds and iteratively refines prompts to mitigate bias in black-box LLM recommenders without retraining model parameters.

Core Problem

Generative LLMs in recommendation systems exhibit subtle biases where changing a sensitive attribute (e.g., gender) alters the recommendation, but determining a robust threshold for 'unfair' deviation is difficult without statistical guarantees.

Why it matters:

LLMs are often deployed as black boxes (APIs), preventing parameter-level bias mitigation like adversarial training.
Generative bias is subtler than classification bias; simple differences in output text style or sentiment can affect user perception.
Standard heuristic thresholds for bias detection lack statistical coverage guarantees, leading to either excessive false alarms or missed discrimination.

Concrete Example: If a prompt asks for a movie recommendation for a 'male teacher' versus a 'female teacher', the LLM might suggest action movies for the male and romance for the female. Existing methods struggle to decide if the semantic distance between these outputs is large enough to constitute a violation or just random noise.

Key Novelty

Conformal Fairness Thresholding with Violation-Triggered Prompt Repair

Uses conformal prediction to calculate a dynamic semantic variance threshold from a calibration set, ensuring a statistical guarantee on the rate of fairness violations.
When a recommendation's semantic distance exceeds this threshold (a violation), the system automatically updates the prompt with an adversarial example to steer the LLM back to fairness.
Operates entirely on the input/output level (black-box), avoiding the need to access or modify the LLM's internal weights.

Evaluation Highlights

Reduces fairness violations by up to 95.5% on MovieLens and Amazon datasets compared to standard prompting.
Maintains strong recommendation accuracy while significantly lowering the Sub-Network Similarity Ratio (SNSR), indicating reduced cross-group semantic gaps.
Demonstrates that semantic variance is a potent proxy for bias, allowing detection without requiring expensive human annotation for every output.

Breakthrough Assessment

7/10

Solid application of conformal prediction to the specific problem of generative fairness thresholds. Highly relevant for black-box API users, though the prompt repair mechanism is a standard technique applied in a novel loop.

⚙️ Technical Details

Problem Definition

Setting: Fairness-aware item recommendation using a black-box LLM

Inputs: User context x (features), protected attribute a (e.g., gender), and reference item y

Outputs: Recommended item/text y_hat

Pipeline Flow

User Query (x, a) → Prompt Construction
LLM Inference → Output y_hat
Violation Detection (Conformal Check)
If Violation: Prompt Repair Loop → Re-Inference

System Modules

Prompt Constructor

Combines user context and current fairness instructions into a prompt

Model or implementation: Template-based injection

Black-Box LLM

Generates recommendation text

Model or implementation: Generic LLM (e.g., GPT-3.5, Llama-2 - specific model varied in experiments)

Conformal Checker (Fairness Control)

Calculates non-conformity score S_new and compares against threshold Q_alpha

Model or implementation: Statistical comparator using Sentence-BERT embeddings

Prompt Repair (Fairness Control)

Updates the system prompt using the detected violation as an adversarial example

Model or implementation: Heuristic update rule

Novel Architectural Elements

Integration of conformal prediction directly into the inference loop to trigger prompt updates
Dynamic fairness threshold Q_alpha that adapts based on the non-conformity scores of incoming data

Modeling

Base Model: Evaluated on multiple LLMs (Llama-2, Llama-3, GPT-3.5 mentioned as representatives in intro, specific experiment model implied as Llama-2-7B or similar standard baseline)

Training Method: In-context learning / Prompt Engineering only

Adaptation: None (Model weights frozen)

Trainable Parameters: 0 (Prompt text is the only variable)

Training Data:

Calibration set D_cal taken from MovieLens/Amazon datasets
Users partitioned into calibration and test sets

Key Hyperparameters:

alpha: Significance level (e.g., 0.1 for 90% coverage)
lambda: Weighting parameter in non-conformity score (balancing accuracy vs. fairness)
tau_rho: Similarity threshold (e.g., 0.9) for defining local neighborhoods
+ 1 more
tau_x: Radius parameter for user-context similarity

Compute: Not reported in the paper

Comparison to Prior Work

vs. UP5: FACTER does not require training a new model architecture; it wraps any existing black-box LLM.
vs. Post-hoc analysis: FACTER uses conformal prediction to set statistically valid thresholds rather than arbitrary heuristics.
vs. Adversarial Training: FACTER modifies the prompt (input) rather than the model parameters (weights), making it suitable for API-based models.

Limitations

Depends on the quality of the embedding model (Sentence-BERT) to measure semantic distance.
Requires a representative calibration set; if calibration data differs distributionally from test data (shift), guarantees may degrade.
Iterative prompting adds latency to inference when violations are detected.
Focuses on textual/semantic fairness, which may not perfectly align with item-utility fairness.

Reproducibility

No code URL provided. Dataset sources (MovieLens, Amazon) are public. The algorithm is described mathematically (Eq 5-10), allowing reimplementation of the conformal scoring logic.

📊 Experiments & Results

Evaluation Setup

Recommendation fairness evaluation on standard datasets

Benchmarks:

MovieLens (Movie Recommendation)
Amazon (Product Recommendation)

Metrics:

Fairness Violation Rate (measured via conformal definition)
SNSV (Sub-Network Similarity Variance)
SNSR (Sub-Network Similarity Ratio)
CFR (Counterfactual Fairness Ratio)
Statistical methodology: Conformal prediction provides theoretical marginal coverage guarantees; empirical violation rates reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MovieLens/Amazon	Violation Rate	Not explicitly reported in the paper	Not explicitly reported in the paper	Reduced by up to 95.5%
The following metrics (SNSV, SNSR, CFR) describe group-level fairness improvements. FACTER aims to lower SNSR (gap between groups) while maintaining SNSV (consistency).

Main Takeaways

Conformal thresholding effectively controls the rate of fairness violations, aligning empirical results with theoretical coverage guarantees (1 - alpha).
The violation-triggered prompt engineering mechanism successfully reduces bias in subsequent outputs without needing to retrain the LLM.
Semantic variance in the embedding space serves as a reliable proxy for detecting generative bias, enabling automated monitoring.
FACTER achieves a favorable trade-off, significantly improving fairness metrics (like SNSR and CFR) with minimal impact on recommendation relevance/accuracy.

📚 Prerequisite Knowledge

Prerequisites

Conformal Prediction
Recommender Systems
Sentence Embeddings (Sentence-BERT)
Prompt Engineering

Key Terms

conformal prediction: A statistical framework that uses past data to determine a threshold (or set) for future predictions, guaranteeing they contain the true value with a specified probability

semantic variance: The variability in meaning (measured by embedding distance) of model outputs when inputs are semantically similar

SNSR: Sub-Network Similarity Ratio—a metric quantifying the semantic gap between demographic groups

SNSV: Sub-Network Similarity Variance—a metric capturing consistency within a demographic group

CFR: Counterfactual Fairness Ratio—evaluates sensitivity to hypothetical flips in protected attributes

calibration set: A held-out dataset used to calculate the non-conformity scores and set the threshold before online deployment

non-conformity score: A measure of how 'strange' or 'unusual' a data point is compared to the calibration set; here, it combines predictive error and semantic distance from other groups

exchangeability: A statistical assumption that the order of data points does not affect their joint distribution, necessary for conformal guarantees