| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Evaluator performance on LLM-AggreFact shows FenCE outperforms much larger models. | ||||
| LLM-AggreFact | Balanced Accuracy (BAcc) | 73.2 | 81.5 | +8.3 |
| LLM-AggreFact | Balanced Accuracy (BAcc) | 80.4 | 81.5 | +1.1 |
| LLM-AggreFact | Balanced Accuracy (BAcc) | 79.3 | 81.5 | +2.2 |
| LLM-AggreFact | Balanced Accuracy (BAcc) | 78.6 | 81.5 | +2.9 |
| Generator improvement results demonstrating the effectiveness of FenCE-based training. | ||||
| FActScore | Factuality Rate | 56.41 | 73.27 | +16.86 |
| FActScore | Factuality Rate | 64.44 | 73.27 | +8.83 |
| TruthfulQA | % True | 47.78 | 65.42 | +17.64 |