Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment

📝 Paper Summary

LLM Safety Input Guardrails Adversarial Robustness

Fine-tuning LLMs with Chain-of-Thought explanations using SFT and alignment techniques (DPO/KTO) significantly improves their ability to detect malicious inputs and jailbreaks with minimal training data.

Core Problem

Standard LLM-as-a-Judge guardrails often fail to detect sophisticated adversarial attacks (jailbreaks) or provide correctly formatted responses, and training them typically requires massive resources.

Why it matters:

Malicious actors use jailbreaks and adversarial prompts to bypass safety filters in commercial AI products, risking reputational damage and safety violations
RLHF alignment is resource-intensive and requires large datasets, making it difficult to quickly adapt guardrails to new attack vectors
LLMs without specific tuning suffer from the 'lost in the middle' phenomenon and often fail to adhere to strict output formats required by downstream systems

Concrete Example: A user might prompt a guardrail with a 'Do Anything Now' (DAN) jailbreak. An untuned LLM might be tricked into accepting it or producing a verbose, unparsable explanation. The proposed aligned model correctly flags it as malicious with a concise, structured JSON verdict.

Key Novelty

CoT-Aligned Input Guardrails via DPO/KTO

Train the guardrail model to output a Chain-of-Thought explanation *before* its final verdict, forcing it to reason about why an input is safe or malicious
Apply preference alignment (DPO or KTO) specifically on these reasoning traces to encourage concise, accurate explanations and strict adherence to output formats

Architecture

The high-level workflow of the Input Guardrail within a Conversational AI system.

Evaluation Highlights

Llama3-8B-Instruct tuned with DPO achieves 172% higher Attack Detection Ratio (ADR) compared to LlamaGuard-2 on the test set
Supervised Fine-Tuning (SFT) alone yields massive gains, improving ADR by up to 344% over baseline zero-shot models
The invalid response ratio (unparsable outputs) drops from ~16.8% in base models to nearly 0.3% after alignment

Breakthrough Assessment

7/10

Demonstrates that small-scale fine-tuning with CoT significantly boosts guardrail performance over general-purpose safety models like LlamaGuard-2, offering a practical recipe for deployment.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of user inputs as 'safe' or 'malicious/jailbreak', accompanied by a natural language explanation

Inputs: User query (potential adversarial prompt)

Outputs: Structured response containing a 'violation' boolean and an 'explanation' string

Pipeline Flow

Input Query → [Guardrail LLM] → Structured Verdict (Explanation + Label)

System Modules

Guardrail LLM

Classify input as safe/unsafe with reasoning

Model or implementation: Various (Llama-3-8B-Instruct, Mistral-7B-v2, etc.) with LoRA adapters

Novel Architectural Elements

Integration of CoT reasoning directly into the guardrail's alignment process (SFT+DPO/KTO) specifically for binary safety classification

Modeling

Base Model: Llama-3-8B-Instruct (best performer), also tested Mistral-7B-Instruct-v2, Mixtral-8x7B-Instruct-v1, Llama-2-13B-Chat

Training Method: SFT followed by Alignment (DPO or KTO)

Objective Functions:

Purpose: Maximize likelihood of correct classification and reasoning.

Formally: Standard Cross-Entropy Loss (for SFT).
Purpose: Align model to prefer concise, accurate CoT reasoning.

Formally: DPO Loss (optimizing policy against reference based on preferences).
Purpose: Align model using binary good/bad signals based on human utility functions.

Formally: KTO Loss (maximizing utility of generations based on prospect theory).

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: Not reported in the paper

Training Data:

400 fine-tuning examples (200 positive/malicious, 200 negative/safe)
Positive examples from AdvBench, MaliciousInstruct, Forbidden Question Set, Jailbreak Prompt Set
Synthetically generated CoT explanations (accepted/rejected) manually annotated for alignment

Key Hyperparameters:

inference_top_p: 1
inference_temperature: 0

Compute: Not reported in the paper

Comparison to Prior Work

vs. LlamaGuard-2: LlamaGuard covers broader categories but is less specialized for specific jailbreaks; this paper's method aligns specifically for reasoning-based detection on a targeted dataset.
vs. DeBERTaV3/PromptGuard: These are classifiers without generation capabilities; the proposed method uses LLM-as-a-Judge to provide interpretable reasoning (CoT) alongside the verdict.

Limitations

Small training and evaluation datasets (400 train, ~6800 test) may not cover all real-world attack vectors
Reliance on synthetic data generation for negative samples and explanations
Evaluation limited to specific types of malicious/jailbreak queries
Potential overfitting to the specific prompt structure used during training

Reproducibility

Datasets are constructed from open-source benchmarks (AdvBench, etc.) combined with synthetic generation. Code and trained weights are not explicitly provided. Methodology is described in detail.

📊 Experiments & Results

Evaluation Setup

Binary classification of inputs as malicious/jailbreak vs. safe

Benchmarks:

Custom Test Set (Safety Classification) [New]

Metrics:

Attack Detection Ratio (ADR)
False Positive Rate (FPR)
F1 Score
Invalid Response Ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning (SFT) and alignment (DPO/KTO) significantly improve detection capabilities across multiple LLMs.
Custom Test Set	Attack Detection Ratio (ADR)	22.2	98.7	+76.5
Custom Test Set	F1 Score	36.2	99.1	+62.9
Custom Test Set	Invalid Response Ratio	16.8	0.3	-16.5
The aligned model outperforms existing open-source safety guardrails on the custom test set.
Custom Test Set	Attack Detection Ratio (ADR)	36.2	98.7	+62.5
Custom Test Set (Jailbreaks only)	Attack Detection Ratio (ADR)	42.8	98.9	+56.1
Custom Test Set (Jailbreaks only)	False Positive Rate (FPR)	99.8	0.8	-99.0

Experiment Figures

Bar charts comparing F1, ADR, FPR, and Invalid Response Ratio across 4 LLMs and 3 tuning strategies (Zero-shot, SFT, DPO, KTO).

Impact of Chain-of-Thought (CoT) prompting on base model performance.

Main Takeaways

Supervised Fine-Tuning (SFT) is the primary driver of performance, yielding massive gains in Attack Detection Ratio (ADR) and F1.
Alignment methods (DPO and KTO) provide marginal additional gains over SFT in accuracy but significantly improve response formatting and conciseness.
Llama-3-8B-Instruct consistently outperformed other base models (Mistral, Mixtral, Llama-2) across all tuning strategies.
Base models struggle significantly with 'standalone jailbreaks' (without malicious payloads), but fine-tuning effectively closes this gap.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM fine-tuning methods (SFT, LoRA)
Familiarity with alignment techniques (RLHF, DPO, KTO)
Knowledge of adversarial attacks on LLMs (Jailbreaks, Prompt Injection)

Key Terms

LLM-as-a-Judge: Using an LLM to evaluate the outputs or inputs of another system, in this case acting as a safety classifier

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before producing a final answer to improve accuracy

SFT: Supervised Fine-Tuning—training a model on a dataset of input-output pairs to teach it a specific task or format

DPO: Direct Preference Optimization—an alignment method that optimizes a model to prefer one response over another without needing a separate reward model

KTO: Kahneman-Tversky Optimization—an alignment method using a loss function based on prospect theory, requiring only binary good/bad labels rather than paired preferences

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices

Jailbreak: Adversarial prompts designed to trick an LLM into bypassing its safety constraints (e.g., 'Do Anything Now' prompts)

ADR: Attack Detection Ratio—the percentage of malicious inputs correctly identified as violations by the guardrail

FPR: False Positive Rate—the percentage of safe inputs incorrectly flagged as malicious violations