← Back to Paper List

Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment

Melissa Kazemi Rad, H. Nghiem, Andy Luo, Sahil Wadhwa, M. Sorower, Stephen Rawls
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

LLM Safety Input Guardrails Adversarial Robustness
Fine-tuning LLMs with Chain-of-Thought explanations using SFT and alignment techniques (DPO/KTO) significantly improves their ability to detect malicious inputs and jailbreaks with minimal training data.
Core Problem
Standard LLM-as-a-Judge guardrails often fail to detect sophisticated adversarial attacks (jailbreaks) or provide correctly formatted responses, and training them typically requires massive resources.
Why it matters:
  • Malicious actors use jailbreaks and adversarial prompts to bypass safety filters in commercial AI products, risking reputational damage and safety violations
  • RLHF alignment is resource-intensive and requires large datasets, making it difficult to quickly adapt guardrails to new attack vectors
  • LLMs without specific tuning suffer from the 'lost in the middle' phenomenon and often fail to adhere to strict output formats required by downstream systems
Concrete Example: A user might prompt a guardrail with a 'Do Anything Now' (DAN) jailbreak. An untuned LLM might be tricked into accepting it or producing a verbose, unparsable explanation. The proposed aligned model correctly flags it as malicious with a concise, structured JSON verdict.
Key Novelty
CoT-Aligned Input Guardrails via DPO/KTO
  • Train the guardrail model to output a Chain-of-Thought explanation *before* its final verdict, forcing it to reason about why an input is safe or malicious
  • Apply preference alignment (DPO or KTO) specifically on these reasoning traces to encourage concise, accurate explanations and strict adherence to output formats
Architecture
Architecture Figure Figure 1
The high-level workflow of the Input Guardrail within a Conversational AI system.
Evaluation Highlights
  • Llama3-8B-Instruct tuned with DPO achieves 172% higher Attack Detection Ratio (ADR) compared to LlamaGuard-2 on the test set
  • Supervised Fine-Tuning (SFT) alone yields massive gains, improving ADR by up to 344% over baseline zero-shot models
  • The invalid response ratio (unparsable outputs) drops from ~16.8% in base models to nearly 0.3% after alignment
Breakthrough Assessment
7/10
Demonstrates that small-scale fine-tuning with CoT significantly boosts guardrail performance over general-purpose safety models like LlamaGuard-2, offering a practical recipe for deployment.
×