Evaluation Setup
Historical replay: Run agent on article state at time T, restricted to sources [T, T+dt]. Compare output to actual human edits made in [T, T+dt].
Benchmarks:
- Wikipedia 2024 Edits (Knowledge Update) [New]
- Editor Test Set (Text Editing) [New]
Metrics:
- Hard Coverage (Chard): Fact matches edit in correct section.
- Soft Coverage (Csoft): Fact matches edit anywhere in article.
- Section Accuracy (SAcc)
- Key Facts Coverage (Editor)
- Commentary Information Coverage (Editor)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Editor Test Set |
Key Facts Coverage |
91.3 |
91.7 |
+0.4
|
| Wikipedia 2024 Edits |
Soft Coverage (Csoft) |
21.5% |
34.4% |
+12.9%
|
| Wikipedia 2024 Edits |
Hard Coverage (Chard) |
30.6% |
15.4% |
-15.2%
|
| User Study |
Acceptance Rate |
N/A |
68% (No Revision) + 29% (With Revision) |
N/A
|
Main Takeaways
- Fine-tuning on human edit history produces editors that are more neutral and concise than generic large models like GPT-4o.
- Agentic, iterative search significantly improves the discovery of relevant updates compared to single-shot queries.
- The 'Hard Coverage' gap suggests that determining *where* to place an update is as challenging as finding the update itself.
- WINELL can effectively function as a 'human-in-the-loop' assistant, with high acceptance rates for its suggestions.