ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

📝 Paper Summary

Instruction Following Constraint Satisfaction Safety and Robustness Evaluation

ConInstruct is a benchmark designed to evaluate how Large Language Models handle user instructions containing conflicting constraints, revealing that while proprietary models can detect conflicts, they rarely inform users about them.

Core Problem

Current instruction-following benchmarks assume coherent constraints, overlooking real-world scenarios where users unintentionally provide conflicting requirements that cannot be simultaneously satisfied.

Why it matters:

If models silently generate responses to conflicting instructions, users may accept incomplete or incorrect outputs without realizing the constraints were impossible to meet.
Existing benchmarks (IFEval, etc.) focus on alignment with valid instructions, leaving model behavior under constraint conflict systematically under-explored.
Blindly following one constraint often leads to violating another, degrading reliability and trustworthiness in complex prompt scenarios.

Concrete Example: A user asks for an email that must 'include the phrase [looking forward to]' (Phrase Constraint) but also 'be strictly under 20 words' (Length Constraint). If the model generates a valid email with the phrase, it likely violates the length limit. GPT-4o typically generates a response violating one constraint without warning, whereas a safe model should ask for clarification.

Key Novelty

Systematic Conflict Evaluation Benchmark (ConInstruct)

Introduces a dataset of instructions with deliberate, diverse conflicts (intra-constraint and inter-constraint) across six distinct NLP tasks.
Evaluates two distinct capabilities: Conflict Detection (can the model identify the contradiction?) and Conflict Resolution (does the model warn the user or just fail silently?).
Categorizes resolution behaviors into 'Conflict Unacknowledged', 'Clarification Requested', and 'Autonomously Resolved' to quantify model transparency.

Evaluation Highlights

Claude-4.5-Sonnet and DeepSeek-R1 achieve the highest conflict detection F1-scores at 87.3% and 91.5% respectively.
Models rarely warn users: GPT-4o generates responses without acknowledging conflicts in 97.5% of cases involving 1-2 conflicts.
Even the safest model, Claude-4.5-Sonnet, explicitly alerts users (requesting clarification or resolving autonomously) in only 45% of cases.

Breakthrough Assessment

8/10

Identifies a critical blind spot in current LLM instruction following—silent failure under conflict. The benchmark methodology is sound and the findings (high detection / low reporting) are highly actionable for future safety alignment.

⚙️ Technical Details

Problem Definition

Setting: Instruction following under conflicting constraints

Inputs: User instruction I containing set of constraints C = {c_1, ..., c_n} where a subset of C is mutually exclusive

Outputs: Model response R and potentially a conflict notification or clarification request

Pipeline Flow

Seed Instruction Curation
Constraint Injection (GPT-4o)
Conflict Injection (GPT-4o)
Human Verification
Model Evaluation (Detection & Resolution)

System Modules

Seed Instruction Generator (Dataset Construction)

Curate 100 base instructions across 6 tasks (email, plan, story, QA, review, article) and 35 domains

Model or implementation: Manual Curation

Constraint Expander (Dataset Construction)

Inject 6 types of constraints (Content, Keyword, Phrase, Length, Format, Style) into seeds

Model or implementation: GPT-4o

Conflict Injector (Dataset Construction)

Generate conflict pairs (one existing constraint + one new contradictory constraint) for 9 conflict types

Model or implementation: GPT-4o

Evaluator (Judge)

Assess whether models detect conflicts or satisfy constraints

Model or implementation: GPT-4o (validated by human annotation)

Novel Architectural Elements

Conflict Pair Generation Strategy: Instead of random injection, pairs of mutually exclusive constraints are generated to ensure guaranteed conflict existence for evaluation.

Modeling

Base Model: Various (GPT-4o, Claude-3.5/4.5, Gemini-1.5, Llama-3, Qwen2.5, DeepSeek-R1)

Compute: Open-source models evaluated on A100 GPUs (40GB). Proprietary models accessed via API.

Reproducibility

Code: https://github.com/NLPCode/ConInstruct

📚 Prerequisite Knowledge

Prerequisites

Understanding of Instruction Tuning and RLHF
Familiarity with standard instruction-following benchmarks (e.g., IFEval)
Basic concepts of precision, recall, and F1 score

Key Terms

Intra-constraint conflicts: Contradictions between two constraints of the same type (e.g., two different length limits)

Inter-constraint conflicts: Contradictions between constraints of different types (e.g., a keyword constraint requiring a word that violates a length constraint)

CSR: Constraint Satisfaction Rate—the percentage of constraints in an instruction that the model successfully adheres to in its output

Seed Instructions: Fundamental, conflict-free instructions (e.g., 'write an email') used as the base for generating complex conflicting prompts

Proprietary LLMs: Closed-source models like GPT-4o and Claude, where weights and training data are not public

Constraint Types: Categories of requirements added to prompts: Content, Keyword, Phrase, Length, Format, and Style