← Back to Paper List

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

Hyunseok Lee, Jihoon Tack, Jinwoo Shin
Korea Advanced Institute of Science and Technology
Neural Information Processing Systems (2024)
RL Factuality Benchmark

📝 Paper Summary

LLM-generated text detection AI Safety Alignment
ReMoDetect identifies machine-generated text by exploiting the observation that aligned LLMs consistently achieve higher scores on reward models than human-written text, enhancing this signal via preference fine-tuning.
Core Problem
Detecting text from recent aligned LLMs (like GPT-4) is difficult because existing methods either overfit to specific training models or fail to capture the subtle commonalities of highly aligned generations.
Why it matters:
  • The proliferation of LLMs increases risks of fake news, plagiarism, and malicious content generation, necessitating reliable detection tools
  • Existing supervised detectors often fail to generalize to unseen models, while zero-shot methods struggle with the high quality of state-of-the-art aligned models
Concrete Example: When an aligned model like GPT-4 generates text, it is optimized to maximize human preference, often resulting in a 'super-human' reward score. A standard classifier might miss this, but a Reward Model assigns it a score (e.g., 0.9) significantly higher than the human baseline (e.g., 0.5), which ReMoDetect uses as a detection signal.
Key Novelty
Reward Model-based Detection with Preference Tuning
  • Leverages the counter-intuitive finding that aligned LLMs generate text with higher predicted reward scores than human text due to alignment training
  • Uses 'Human/LLM mixed texts' (human text partially rephrased by LLMs) as near-decision boundary samples to help the model learn a sharper distinction between human and machine text
Evaluation Highlights
  • Achieves 97.9% AUROC on detecting GPT-4 generated text, outperforming the prior state-of-the-art (Fast-DetectGPT) by 7.3 percentage points
  • Surpasses the commercial detector GPTZero by roughly 10 percentage points in average AUROC (95.8% vs 85.9%) across multiple aligned LLMs
  • Demonstrates robust generalization, improving detection on Claude 3 Opus from 92.6% (Fast-DetectGPT) to 98.6%
Breakthrough Assessment
8/10
Offers a clever, theoretically grounded insight (alignment causes 'super-human' reward scores) that simplifies detection into a single forward pass while achieving SOTA results.
×