Argumentative Large Language Models for Explainable and Contestable Claim Verification

📝 Paper Summary

Explainable AI (XAI) Neuro-symbolic AI Computational Argumentation

ArgLLMs augment Large Language Models with formal argumentation frameworks to produce decisions that are deterministically computed, faithfully explainable, and formally contestable.

Core Problem

LLMs often act as black boxes, providing outputs without faithful explanations of their reasoning or mechanisms for users to reliably contest and correct mistakes.

Why it matters:

LLMs suffer from hallucinations and logical inconsistencies, making unverified outputs risky for high-stakes decision-making
Current contestation methods (like re-prompting) are stochastic and provide no guarantee that user feedback will correct the reasoning
Existing 'Chain-of-Thought' explanations are not necessarily faithful to the model's actual computation or the final output

Concrete Example: A user might challenge an LLM's claim verification. With standard LLMs, prompting 'you are wrong because X' might randomly change the output or result in hallucinations. With ArgLLMs, a user can specifically attack a supporting argument (e.g., 'Evidence Y is unreliable'), and the system deterministically recalculates the claim's strength based on formal semantics.

Key Novelty

Argumentative LLMs (ArgLLMs)

Uses the LLM to generate discrete pro/con arguments and their relationships (attacks/supports) rather than just a final answer
Constructs a Quantitative Bipolar Argumentation Framework (QBAF) from these outputs, which is a symbolic graph of reasoning steps
Computes the final decision deterministically using a gradual semantics algorithm (DF-QuAD) over the graph, ensuring the output mathematically follows from the generated arguments

Architecture

The complete ArgLLM pipeline from input to decision.

Evaluation Highlights

Achieves comparable accuracy to Chain-of-Thought (CoT) prompting across three claim verification datasets (TruthfulQA, StrategyQA, MedQA), with differences often <1%
Provides formal guarantees of contestability, proving that changing argument strengths or relations in the graph necessarily impacts the final evaluation
Demonstrates high faithfulness because the final classification is a direct mathematical result of the visible argument graph, unlike black-box generation

Breakthrough Assessment

7/10

Provides a strong neuro-symbolic bridge for explainability and contestability without sacrificing significant performance compared to standard prompting. A solid step toward reliable AI decision-making.

⚙️ Technical Details

Problem Definition

Setting: Binary claim verification where a claim c (potentially given context i) is mapped to a truth value v(c) in {0, 1}

Inputs: A natural language claim c and optional context i

Outputs: A binary decision (True/False) derived from the computed strength of the claim within a generated argumentation framework

Pipeline Flow

Argument Generation (LLM generates pro/con arguments)
Relation Identification (LLM identifies attacks/supports)
Scoring (LLM assigns intrinsic strengths)
Graph Construction (QBAF assembly)
Evaluation (DF-QuAD algorithm computes final strengths)

System Modules

Argument Generator

Generates arguments supporting or attacking the input claim

Model or implementation: Llama-2-13b-chat (or GPT-4 for comparison)

Relation Identifier

Determines if generated arguments attack or support the claim (and potentially each other in deep trees)

Model or implementation: Llama-2-13b-chat

Scorer

Assigns an intrinsic base score (tau) to each argument and the main claim

Model or implementation: Llama-2-13b-chat

Reasoning Engine

Calculates the final dialectical strength of the claim

Model or implementation: Symbolic Algorithm (DF-QuAD)

Novel Architectural Elements

Substitution of the LLM's final decision layer with a symbolic Argumentation Framework (QBAF)
Hybrid inference pipeline where LLM provides content/structure and symbolic engine performs the adjudication

Modeling

Base Model: Llama-2-13b-chat

Training Method: Prompt engineering (5-shot) only; no fine-tuning reported in this paper

Compute: Experiments run on 1x NVIDIA A100 80GB GPU

Comparison to Prior Work

vs. CoT: ArgLLMs produce a structured graph where the final decision is mathematically derived from the components, whereas CoT does all reasoning in the black-box generation.
vs. Tree-of-Thought: ArgLLMs focus on dialectical relations (attack/support) for claim verification rather than problem decomposition for planning.
vs. Standard: ArgLLMs provide explainability and contestability guarantees that simple prompting lacks.

Limitations

The extraction of arguments and relations relies on the LLM, which may still hallucinate or fail to identify correct relations.
The approach currently focuses on depth-1 trees (arguments attacking/supporting the claim directly) for the main experiments, though the formalism supports arbitrary depth.
Performance is comparable but not strictly superior to CoT in terms of accuracy; the main gain is interpretability.

Reproducibility

Code: https://github.com/CLArg-group/argumentative-llms

Code and datasets are publicly available at github.com/CLArg-group/argumentative-llms. The paper uses Llama-2-13b-chat and GPT-4, standard accessible models.

📊 Experiments & Results

Evaluation Setup

Binary claim verification on three datasets

Benchmarks:

TruthfulQA (Fact verification (identifying truthful answers))
StrategyQA (Multi-hop reasoning verification)
MedQA (Medical claim verification)

Metrics:

Accuracy
Macro F1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of ArgLLM variants against standard prompting and Chain-of-Thought (CoT) baselines across three datasets. ArgLLM (Chat) uses Llama-2-13b-chat.
TruthfulQA	Accuracy	0.50	0.56	+0.06
StrategyQA	Accuracy	0.59	0.62	+0.03
MedQA	Accuracy	0.40	0.45	+0.05
TruthfulQA	Accuracy	0.56	0.56	0.00
StrategyQA	Accuracy	0.53	0.62	+0.09

Experiment Figures

A visual example of a Quantitative Bipolar Argumentation Framework (QBAF) for a claim about 'Ismaila Sarr'.

Main Takeaways

ArgLLMs generally match or slightly exceed the performance of Chain-of-Thought and Standard prompting across the tested datasets.
The primary contribution is not raw performance dominance, but the addition of faithful explainability and contestability without sacrificing accuracy.
The system is robust to the choice of underlying LLM, with GPT-4 based ArgLLMs showing similar trends relative to baselines (though higher absolute scores).

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and prompting
Familiarity with computational argumentation frameworks (graphs with attack/support relations)
Knowledge of gradual semantics for evaluating argument strength

Key Terms

QBAF: Quantitative Bipolar Argumentation Framework—a directed graph where nodes are arguments (with intrinsic strengths) and edges represent attack or support relations

Gradual Semantics: An algorithm that calculates the final strength of an argument based on its initial strength and the strengths of its attackers and supporters

DF-QuAD: Discontinuity-Free Quantitative Argumentation Debate—a specific gradual semantics algorithm used to aggregate argument strengths deterministically

Chain-of-Thought: A prompting technique that encourages LLMs to generate intermediate reasoning steps before the final answer

Faithfulness: The property that the explanation provided by the system accurately reflects the process used to generate the output

Contestability: The ability for a user to intervene in the reasoning process (e.g., by modifying arguments) and reliably influence the outcome

Pro/Con Arguments: Arguments that either support (pro) or attack (con) a central claim or other arguments

Intrinsic Strength: The initial score assigned to an argument (e.g., by the LLM) before considering the impact of other arguments

Dialectical Strength: The final score of an argument after processing the full graph of attacks and supports via gradual semantics