| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Correlation analysis validates that FactLens automated evaluators align reasonably well with human judgments, though subjectivity in 'Sufficiency' lowers agreement. | ||||
| Synthetic Dataset | Pearson Correlation (Atomicity) | 0 | 0.45 | +0.45 |
| Synthetic Dataset | Pearson Correlation (Coverage) | 0 | 0.60 | +0.60 |
| Synthetic Dataset | Pearson Correlation (Sufficiency) | 0 | 0.27 | +0.27 |
| Model performance on decomposition shows high sufficiency and coverage but struggles significantly with atomicity. | ||||
| FactLens (CoverBench subset) | Atomicity Score (1-3) | 3.00 | 1.89 | -1.11 |
| FactLens (CoverBench subset) | Sufficiency Score (1-3) | 3.00 | 2.98 | -0.02 |