Evaluation Setup
4,900 test runs of agent pairs (Primary vs. Reviewer) generating and critiquing a blog about a non-existent subject.
Benchmarks:
- Flipfloppidy Hallucination Test (Factual Verification / Hallucination Detection) [New]
Metrics:
- Hallucination Identification Rate (%)
- Revision Success Rate (%)
- Interaction Time (seconds)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Advanced models consistently detect hallucinations, whereas smaller models fail significantly. |
| Flipfloppidy Hallucination Test |
Hallucination Identification Rate |
0 |
98 |
+98
|
| Flipfloppidy Hallucination Test |
Revision Success Rate |
46 |
86 |
+40
|
| Flipfloppidy Hallucination Test |
Interaction Time (seconds) |
35.0 |
2.22 |
-32.78
|
Main Takeaways
- Model size and sophistication are critical for the 'Reviewer' role; small models (Gemma, Mistral) struggle to identify hallucinations or accept feedback.
- Self-correction is highly effective for advanced models (GPT-4, Llama-3-70b), often exceeding 85% success rates.
- Smaller models like Llama3-8b can sometimes successfully critique larger models (GPT-4), functioning as effective 'weak supervisors' in specific contexts.
- Inference speed varies drastically, with Groq-hosted models offering 10x faster interactions than GPT-4 API calls, enabling real-time checking.