← Back to Paper List

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

X He, Q Zhang, P Chen, G Chen, L Yu, Y Yuan, SM Yiu
Not explicitly listed in the snippet provided
arXiv, 11/2025 (2025)
Benchmark Reasoning

📝 Paper Summary

Instruction Following Constraint Satisfaction Safety and Robustness Evaluation
ConInstruct is a benchmark designed to evaluate how Large Language Models handle user instructions containing conflicting constraints, revealing that while proprietary models can detect conflicts, they rarely inform users about them.
Core Problem
Current instruction-following benchmarks assume coherent constraints, overlooking real-world scenarios where users unintentionally provide conflicting requirements that cannot be simultaneously satisfied.
Why it matters:
  • If models silently generate responses to conflicting instructions, users may accept incomplete or incorrect outputs without realizing the constraints were impossible to meet.
  • Existing benchmarks (IFEval, etc.) focus on alignment with valid instructions, leaving model behavior under constraint conflict systematically under-explored.
  • Blindly following one constraint often leads to violating another, degrading reliability and trustworthiness in complex prompt scenarios.
Concrete Example: A user asks for an email that must 'include the phrase [looking forward to]' (Phrase Constraint) but also 'be strictly under 20 words' (Length Constraint). If the model generates a valid email with the phrase, it likely violates the length limit. GPT-4o typically generates a response violating one constraint without warning, whereas a safe model should ask for clarification.
Key Novelty
Systematic Conflict Evaluation Benchmark (ConInstruct)
  • Introduces a dataset of instructions with deliberate, diverse conflicts (intra-constraint and inter-constraint) across six distinct NLP tasks.
  • Evaluates two distinct capabilities: Conflict Detection (can the model identify the contradiction?) and Conflict Resolution (does the model warn the user or just fail silently?).
  • Categorizes resolution behaviors into 'Conflict Unacknowledged', 'Clarification Requested', and 'Autonomously Resolved' to quantify model transparency.
Evaluation Highlights
  • Claude-4.5-Sonnet and DeepSeek-R1 achieve the highest conflict detection F1-scores at 87.3% and 91.5% respectively.
  • Models rarely warn users: GPT-4o generates responses without acknowledging conflicts in 97.5% of cases involving 1-2 conflicts.
  • Even the safest model, Claude-4.5-Sonnet, explicitly alerts users (requesting clarification or resolving autonomously) in only 45% of cases.
Breakthrough Assessment
8/10
Identifies a critical blind spot in current LLM instruction following—silent failure under conflict. The benchmark methodology is sound and the findings (high detection / low reporting) are highly actionable for future safety alignment.
×