Evaluation Setup
Comparison of extracted claims from 396 BingCheck answers generated by Microsoft Copilot.
Benchmarks:
- BingCheck (Long-form Question Answering)
Metrics:
- Entailment %
- Sentence-Level Coverage (Accuracy/F1)
- Element-Level Coverage (Accuracy/F1)
- Decontextualization (Percentage of 'Desirable' outcomes)
- Statistical methodology: Two-proportion Z-tests with Holm-Bonferroni correction
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Entailment results measure if the extracted claims are supported by the source text. |
| BingCheck |
Entailment % |
89.1 |
99.0 |
+9.9
|
| Coverage results measure how well verifiable content is captured (Recall) and unverifiable content excluded (Precision). |
| BingCheck |
Sentence-Level Accuracy |
81.6 |
91.8 |
+10.2
|
| BingCheck |
Element-Level Macro F1 |
56.2 |
83.7 |
+27.5
|
| Decontextualization results measure if the claim is self-contained enough to produce the same verification verdict as the fully contextualized version. |
| BingCheck |
% Desirable Results (Google Search) |
78.4 |
80.6 |
+2.2
|
| BingCheck |
% Desirable Results (Bing Search) |
79.3 |
80.5 |
+1.2
|
Main Takeaways
- Claimify consistently achieves the best balance of high entailment and high coverage, avoiding the trade-off seen in baselines (e.g., VeriScore has high entailment but poor coverage).
- The 'Selection' stage is critical; removing it drops element-level coverage F1 from 83.7% to 54.4%, showing the importance of pre-filtering unverifiable content.
- Current NLI models are insufficient for evaluating claim entailment; a custom LLM prompt aligned much better with human judgment.
- Outcome-based decontextualization evaluation reveals that seemingly 'decontextualized' claims often fail to retrieve the correct evidence unless rigorously checked against a max-context version.