Evaluation Setup
Factuality evaluation on LLM-AggreFact; Generator improvement on FActScore and TruthfulQA
Benchmarks:
- LLM-AggreFact (Factuality Judgment (Aggregation of 10 datasets))
- FActScore (Biography Generation Factuality)
- TruthfulQA (QA Truthfulness)
Metrics:
- Balanced Accuracy (BAcc)
- FActScore (Factuality Rate)
- TruthfulQA % True
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Evaluator performance on LLM-AggreFact shows FenCE outperforms much larger models. |
| LLM-AggreFact |
Balanced Accuracy (BAcc) |
73.2 |
81.5 |
+8.3
|
| LLM-AggreFact |
Balanced Accuracy (BAcc) |
80.4 |
81.5 |
+1.1
|
| LLM-AggreFact |
Balanced Accuracy (BAcc) |
79.3 |
81.5 |
+2.2
|
| LLM-AggreFact |
Balanced Accuracy (BAcc) |
78.6 |
81.5 |
+2.9
|
| Generator improvement results demonstrating the effectiveness of FenCE-based training. |
| FActScore |
Factuality Rate |
56.41 |
73.27 |
+16.86
|
| FActScore |
Factuality Rate |
64.44 |
73.27 |
+8.83
|
| TruthfulQA |
% True |
47.78 |
65.42 |
+17.64
|
Main Takeaways
- Augmenting evaluator training data with tool-retrieved documents and textual critiques significantly boosts judgment accuracy, allowing an 8B model to outperform 100B+ proprietary models.
- The 'Self-Knowledge Check' (filtering out facts the model doesn't know) is crucial for factuality training; it prevents the common pitfall where RLHF reinforces hallucination of obscure details.
- FenCE-based training generalizes well, showing improvements across both biography generation (FActScore) and QA truthfulness (TruthfulQA).