Evaluation Setup
Benchmarking LLMs as factuality evaluators on the constructed FELM dataset.
Benchmarks:
- FELM (Factuality Detection) [New]
Metrics:
- F1 Score
- Accuracy
- Precision
- Recall
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| FELM |
Response-level Error Rate |
Not reported in the paper |
31.8% |
Not reported in the paper
|
| FELM |
Inter-Annotator Agreement |
Not reported in the paper |
90.7% |
Not reported in the paper
|
Main Takeaways
- Factuality error detection remains a challenging task for current LLMs (ChatGPT, GPT-4), even when augmented with retrieval or Chain-of-Thought.
- Retrieval mechanisms help improve factuality evaluation but are not a complete solution.
- Claim-based evaluators (extracting atomic facts) are generally more effective than segment-based or response-based evaluators (qualitative finding discussed in text).
- LLMs struggle specifically with 'Reasoning' and 'Math' domains compared to standard World Knowledge.