Evaluation Setup
Evaluation on RAGTruth test set across QA, data-to-text, and summarization tasks.
Benchmarks:
- RAGTruth (Hallucination Detection (Example-level and Span-level))
Metrics:
- F1 Score
- Precision
- Recall
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LettuceDetect Large outperforms most baselines on example-level detection, including GPT-4 and previous SOTA encoders. |
| RAGTruth |
F1 Score |
65.4 |
79.22 |
+13.82
|
| RAGTruth |
F1 Score |
63.4 |
79.22 |
+15.82
|
| RAGTruth |
F1 Score |
78.7 |
79.22 |
+0.52
|
| RAGTruth |
F1 Score |
83.9 |
79.22 |
-4.68
|
| On span-level detection (locating the specific hallucinated text), LettuceDetect sets a new state-of-the-art. |
| RAGTruth |
F1 Score |
52.7 |
58.93 |
+6.23
|
Main Takeaways
- Specialized encoder models (LettuceDetect) can significantly outperform generalist LLMs (GPT-4) on specific hallucination detection tasks.
- Long-context capability in encoders is crucial for RAG verification; standard BERT limits performance.
- The framework offers a massive efficiency gain (30-60 examples/sec) compared to LLM-based judges, enabling real-time checking.
- While Llama-3-8B (RAG-HAT) performs better on example-level classification, LettuceDetect offers a better trade-off for latency-sensitive applications.