FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

📝 Paper Summary

Factuality Evaluation Hallucination Detection

The authors introduce a 3D (Decompose, Decouple, Detach) paradigm to evaluate the factuality of subjective, interpretive AI claims in contact center conversations, releasing a benchmark dataset (FECT) grounded in this method.

Core Problem

Evaluating the factuality of AI-generated summaries for contact center conversations is difficult because claims are often analytical interpretations (e.g., sentiment, root cause) rather than verbatim facts, lacking clear ground truth.

Why it matters:

Traditional hallucination detection relies on direct fact-checking against explicit evidence, which fails for subjective interpretations common in enterprise analysis
Ambiguity in analytical claims leads to low human inter-annotator agreement, making it impossible to reliably evaluate or improve AI systems in high-stakes business contexts

Concrete Example: A claim stating 'The customer chose the plan for specific dentist coverage' might be true based on implied context (a mention of a 'particular dentist' followed by a plan change), but lacks explicit text evidence. Without a structured framework, one annotator might mark this factual while another marks it hallucinated due to the subjective inference required.

Key Novelty

3D Paradigm (Decompose, Decouple, Detach)

Decompose claims into minimal units, Decouple concrete terms from subjective interpretations, and Detach structure from meaning to verify relations
Applies linguistic principles (constituency, compositionality) to ground human and LLM judgments, significantly improving agreement on subjective claims

Architecture

The workflow for developing the FECT dataset and aligning LLM-judges.

Evaluation Highlights

Achieved 0.82 inter-annotator agreement (Cohen's kappa) among human experts using the 3D paradigm, up from 0.28 with unguided labeling
OpenAI o1 model aligned on the 3D paradigm achieved 0.86 F1 score on the FECT benchmark without fine-tuning
Reasoning models (like o1) outperform non-reasoning models significantly on this task, with a +0.17 F1 gap between o1 and GPT-4o

Breakthrough Assessment

8/10

Addresses a neglected but critical area of hallucination detection (interpretive claims) with a rigorous linguistic framework. The high human agreement improvement validates the method's effectiveness.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of AI-generated claims as factual or non-factual based on a source conversation transcript

Inputs: A contact center conversation transcript and a single-sentence interpretive claim derived from it

Outputs: Binary label: Factual or Non-factual

Pipeline Flow

Input: Conversation + Claim
Prompting Strategy (3D Paradigm)
LLM Generation (Reasoning/Labeling)
Output: Factual/Non-factual

System Modules

Human Annotation / Benchmark Creation

Establish ground truth using 3D paradigm to filter ambiguous claims

Model or implementation: Human Experts

LLM Judge

Predict factuality of claims using specific prompting strategies

Model or implementation: Various (o1-2024-12-17, gpt-4o-2024-08-06, etc.)

Novel Architectural Elements

Integration of a linguistically-informed 3D (Decompose, Decouple, Detach) evaluation flow into the prompt structure for LLM-judges

Modeling

Base Model: OpenAI o1 (o1-2024-12-17) [Primary best performing model]

Training Method: Prompt Engineering (Zero-shot with reasoning)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard LLM-as-a-Judge: Imposes a 3-step linguistic decomposition (3D) rather than asking for a holistic judgment
vs. HallOumi: Achieves higher performance via reasoning-aligned prompting rather than fine-tuning
vs. FactCheck/FactScore [not cited in paper]: Focuses on implicit/interpretive claims in dialog rather than atomic fact verification in Wikipedia-style text

Limitations

Dataset is synthetic (though modeled after real enterprise data) due to privacy constraints
Focuses specifically on contact center domain; generalization to other interpretive domains is untested
Relies on proprietary models (OpenAI o1) for best performance; open weights models performed significantly worse

Reproducibility

Code: https://github.com/cresta/fect

📊 Experiments & Results

Evaluation Setup

Binary classification of claim factuality against conversation transcripts

Benchmarks:

FECT (Factuality Classification in Dialog) [New]

Metrics:

F1 score
Precision
Recall
Statistical methodology: 95% confidence intervals reported based on 10 runs per model/prompt configuration

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison of LLM-judges on the FECT dataset, highlighting the superiority of reasoning models and the 3D prompting paradigm.
FECT	F1	0.69	0.86	+0.17
FECT	F1	0.58	0.86	+0.28
FECT	F1	0.83	0.86	+0.03

Experiment Figures

F1 scores of 17 different LLMs across 4 prompt variations (3D vs Basic, with vs without TTC).

Main Takeaways

Alignment with the 3D paradigm (Decompose, Decouple, Detach) allows LLMs to mirror human expert judgment on subjective claims.
Reasoning models (o1) significantly outperform standard models (GPT-4o) and specialized small models (HallOumi) on interpretive factuality tasks.
The gap between BASIC and 3D prompts is smaller for strong reasoning models, suggesting they may intrinsically perform similar decomposition, but 3D still yields the best absolute score.
Non-reasoning models struggle with this task regardless of prompting strategy, often falling below 0.70 F1.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM hallucinations and factuality evaluation
Basics of linguistic semantics (constituency, compositionality)
Familiarity with LLM-as-a-Judge concepts

Key Terms

LLM-as-a-Judge: Using a Large Language Model to evaluate the outputs of another model (e.g., for correctness or safety)

3D Paradigm: Decompose, Decouple, Detach—the authors' proposed framework for evaluating analytical claims by breaking them down into linguistic components

interpretive claims: Summaries or insights that infer meaning (sentiment, intent, root cause) rather than just restating explicit facts from the text

Cohen's kappa: A statistic that measures inter-annotator agreement for categorical items, accounting for the possibility of agreement occurring by chance

F1 score: A metric balancing precision and recall, used here to measure how well LLM-judges align with human ground-truth labels

TTC: Test-Time Compute—allowing a model to generate intermediate reasoning tokens (like Chain-of-Thought) before producing a final answer to improve performance