SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

📝 Paper Summary

Hallucination suppression Adversarial attacks for LLMs

SECA uses an LLM-based proposer and feasibility checker to generate adversarial prompts that are semantically identical to the original and linguistically coherent, yet successfully trick target models into hallucinating.

Core Problem

Existing adversarial methods for eliciting hallucinations rely on unrealistic prompts (gibberish tokens or semantic shifts), failing to evaluate how hallucinations occur in realistic, user-facing scenarios.

Why it matters:

Prior attacks use nonsensical tokens (e.g., 't)(?e va%&*') or alter the question's meaning, which doesn't reflect real-world user behavior
Realistic variations (e.g., paraphrasing) can trigger catastrophic failures in high-stakes domains like medicine or finance, but these vulnerabilities are currently underexplored
Current robustness evaluations overestimate model reliability because they do not test against semantically equivalent but adversarial rephrasings

Concrete Example: For the prompt 'what is the value of p in 24 = 2p?', a model answers correctly. However, SECA finds the semantically equivalent prompt 'If doubling the value of p results in 24, what is p?', which triggers the model to hallucinate 'p = 8, because 24/2 = 8'.

Key Novelty

Constraint-Preserving Zeroth-Order Optimization for Realistic Attacks

Formulates attack generation as a constrained optimization problem: maximize the probability of a specific hallucinated token subject to strict semantic equivalence and coherence constraints
Solves this using a 'proposer-checker' pipeline: a Proposer LLM generates diverse rephrasings, and a Feasibility Checker LLM strictly filters out any that change the meaning or are incoherent
Unlike gradient-based attacks (which often produce gibberish), SECA operates purely on discrete text via LLM interactions, ensuring the output looks like a natural human question

Architecture

The iterative optimization loop of SECA. It illustrates how an initial prompt is processed by a Proposer LLM to generate candidates, which are then filtered by a Feasibility Checker LLM before being evaluated by the Target LLM.

Evaluation Highlights

Increases Attack Success Rate (ASR@30) from 48.2% to 80.3% on Llama-3-3B compared to raw prompts
Achieves 81.2% ASR@30 on Llama-3-8B (vs 63.5% for raw prompts) while maintaining near-zero semantic equivalence errors
Outperforms GCG (gibberish attack) significantly on commercial models; e.g., on Qwen-2.5-7B, SECA achieves 36.9% ASR vs 10.2% for raw prompts, while GCG achieves 60.6% but with massive coherence violations

Breakthrough Assessment

8/10

Significant advance in realistic red-teaming. Moves beyond 'jailbreaking via gibberish' to finding natural language failures that are actually semantically valid, a critical step for deploying reliable LLMs.

⚙️ Technical Details

Problem Definition

Setting: Constrained optimization over discrete prompt space to maximize likelihood of a target incorrect token

Inputs: Original prompt x0, target incorrect token y*

Outputs: Adversarial prompt x that is semantically equivalent to x0 and coherent

Pipeline Flow

Initialization (Start with N copies of original prompt)
Loop until max iterations or success:
Proposer LLM (Generates M rephrased candidates)
Target LLM (Evaluates adversarial strength of candidates)
Feasibility Checker LLM (Filters candidates for semantic equivalence)
Selection (Keep top-N valid candidates for next round)

System Modules

Proposer LLM (P)

Generate diverse, semantically equivalent rephrasings of the input prompt

Model or implementation: GPT-4.1-Nano

Target LLM (T)

Compute the probability of the target incorrect token given a candidate prompt

Model or implementation: Various (Llama-3, GPT-4o-Mini, etc.)

Feasibility Checker LLM (F)

Verify if the candidate prompt is semantically equivalent to the original prompt

Model or implementation: GPT-4.1-Mini

Novel Architectural Elements

LLM-enforced constraint loop: Replacing mathematical projection steps (common in continuous optimization) with a 'Propose-then-Check' mechanism using distinct LLM roles

Modeling

Base Model: Target models: Llama-3-3B/8B, Qwen-2.5-7B/14B, Llama-2-13B, GPT-4.1-Nano, GPT-4o-Mini

Comparison to Prior Work

vs. GCG: SECA generates fluent, semantically equivalent prompts (SEE≈0, SCE low) whereas GCG generates gibberish (high SCE) with high semantic error
vs. Investigator Agent/BEAST: SECA enforces strict semantic equivalence to the *original* question, whereas these methods allow the task meaning to shift (meaning-shift attacks), which doesn't isolate hallucination on the original fact
vs. PAIR [not cited in paper]: PAIR uses an attacker LLM to find jailbreaks iteratively; SECA is similar in structure but adds the explicit 'Feasibility Checker' step to enforce semantic equivalence, which PAIR does not constrain

Limitations

Computational cost is high due to multiple LLM calls (Proposer + Checker) per iteration
Relies on the capability of the Feasibility Checker LLM; if the checker fails, invalid prompts might leak through (though experiments show high agreement with humans)
Currently limited to targeted attacks (forcing a specific incorrect token) in multiple-choice settings
Attack success depends on the Proposer LLM's diversity; if it cannot find a phrasing, optimization stalls

Reproducibility

Code: https://github.com/Buyun-Liang/SECA

Code is publicly available at https://github.com/Buyun-Liang/SECA. The paper specifies the models used for proposing (GPT-4.1-Nano), checking (GPT-4.1-Mini), and coherence evaluation (GPT-2). Prompt templates for the proposer and checker are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Open-ended Multiple-Choice Question Answering (MCQA) on filtered MMLU dataset (only questions where target model initially answers correctly)

Benchmarks:

MMLU (filtered) (Knowledge-intensive QA across 16 subjects)

Metrics:

ASR@K (Attack Success Rate at K)
SEE (Semantic Equivalence Error)
SCE (Semantic Coherence Error, measured via GPT-2 perplexity)
TTR (Type-Token Ratio)
Statistical methodology: Standard deviation calculated over 10,000 bootstrap samples with replacement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SECA significantly improves attack success rates (ASR@30) compared to raw prompts on open-source Llama models, while maintaining semantic validity.
MMLU (filtered)	ASR@30	48.20	80.29	+32.09
MMLU (filtered)	ASR@30	63.52	81.24	+17.72
SECA outperforms raw prompts on Qwen models, where raw prompts rarely induce hallucinations. GCG achieves higher ASR but with catastrophic semantic and coherence errors.
MMLU (filtered)	ASR@30	10.19	36.86	+26.67
MMLU (filtered)	SEE	0.96	0.00	-0.96
MMLU (filtered)	SCE	1036.62	1.06	-1035.56

Experiment Figures

ASR@30 performance across 16 MMLU subjects for 7 different models, comparing Raw prompts vs SECA.

Analysis plots: (a) Objective value progression over iterations, (b) Distribution of hallucination types, (c) Lexical diversity (TTR) and length analysis.

Main Takeaways

Commercial models (GPT-4o-Mini) and strong open weights (Qwen-2.5) are robust to raw prompts (ASR < 10%) but vulnerable to SECA (ASR increases by ~20% points)
SECA prompts are more verbose (longer) and lexically diverse (higher Type-Token Ratio) than original prompts, suggesting complexity triggers hallucinations
The method converges efficiently, typically within 30 iterations, steadily increasing the target token probability
Human evaluation confirms high agreement with the automated Feasibility Checker (Kappa 0.63-0.74) and Hallucination Evaluator (Kappa 0.75-0.87)

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and hallucination phenomena
Basic understanding of adversarial attacks (e.g., token-level optimization)
Concepts of semantic equivalence (entailment) and coherence (perplexity)

Key Terms

hallucination: When an LLM generates content that is factually incorrect or unfaithful to the input source, often confidently

semantic equivalence: A relationship where two prompts share the exact same meaning, intent, and ground-truth answer, despite lexical differences

semantic coherence: The quality of a text being grammatically correct, fluent, and meaningful to a human reader (measured here via perplexity)

zeroth-order optimization: Optimization methods that do not require gradient information (derivatives), relying instead on function evaluations (e.g., querying the LLM)

GCG: Greedy Coordinate Gradient—a gradient-based attack method that optimizes discrete tokens to force model behavior, often resulting in gibberish suffixes

ASR@K: Attack Success Rate at K—the percentage of samples for which at least one successful attack is found within K attempts

TTR: Type-Token Ratio—a measure of lexical diversity calculated as the ratio of unique tokens to total tokens

perplexity: A metric measuring how surprised a model is by a sequence of text; lower perplexity indicates more natural/coherent text