WINELL: Wikipedia Never-Ending Updating with LLM Agents

📝 Paper Summary

Multi-agent Web agents Agent evolution

WINELL is an agentic framework that autonomously updates Wikipedia articles by inducing section-specific criteria, iteratively aggregating online information via multi-agent search, and generating precise edits using a model fine-tuned on historical human edits.

Core Problem

Wikipedia relies on manual edits, causing significant latency between the publication of new information and its incorporation into articles, especially for less popular pages.

Why it matters:

Manual maintenance cannot keep up with the vast scale of evolving real-world information.
Existing automated methods focus on infoboxes or assume facts are already given, rather than autonomously discovering and integrating full textual updates.

Key Novelty

An end-to-end agentic loop for 'Never-Ending' Wikipedia updating that combines structural analysis, iterative web search, and human-mimicking fine-grained editing.

Induces 'Section Criteria' from the article structure to guide relevance.
Uses a multi-agent 'Navigator-Extractor-Aggregator' loop to find and de-duplicate updates.
Fine-tunes a specific 'Editor' model on historical human edits to replicate the style and neutrality of Wikipedia contributors.
Introduces an automatic evaluation methodology using historical human edits as ground truth.

Architecture

Overview of WINELL pipeline: Article -> Section Criteria -> Agentic Update Aggregation (Loop) -> Fine-Grained Editor -> Updated Article.

Evaluation Highlights

Fine-grained Editor: The fine-tuned Llama-3.1-8B-Editor achieves 91.7% Key Facts Coverage with only 18.7% Commentary (noise) retention, outperforming GPT-4o (53.1% commentary retention).
End-to-End Coverage: WINELL achieves a Soft Coverage (capturing correct facts anywhere) of 34.4% compared to factual human edits.
Ablation Impact: Removing the agentic search component drops Soft Coverage from 34.4% to 21.5%, showing the value of iterative information seeking.
Human Evaluation: 68% of WINELL's suggested edits were accepted by experienced Wikipedia editors without revision.

Breakthrough Assessment

7/10

It represents a significant step towards autonomous knowledge base maintenance using agents. The evaluation methodology (simulating historical updates) is clever and rigorous. While current hard coverage (15.4%) shows room for improvement in precise placement, the framework successfully modernizes the NELL concept.

⚙️ Technical Details

Pipeline Flow

Input: Wikipedia Article W
Section Criteria Induction: LLM analyzes article structure to define what content belongs in each section.
Agentic Update Aggregation: Iterative web search loop to find, extract, and aggregate new facts.
Fine-Grained Editing: A specialized model integrates the aggregated updates into the specific text sections.
Output: Edit suggestions (additions/modifications) for human review.

System Modules

Section Criteria Inductor

Define content policy for each section.

Model or implementation: GPT-4.1

Agentic Update Aggregator (Navigator, Extractor, Aggregator)

Search web, extract facts, filter redundancy.

Model or implementation: GPT-4.1 mini + Google Search API

Fine-Grained Editor

Integrate updates into paragraph text.

Model or implementation: Fine-tuned Llama-3.1-8B or Qwen2.5-7B

📊 Experiments & Results

Evaluation Setup

Historical replay: Run agent on article state at time T, restricted to sources [T, T+dt]. Compare output to actual human edits made in [T, T+dt].

Benchmarks:

Wikipedia 2024 Edits (Knowledge Update) [New]
Editor Test Set (Text Editing) [New]

Metrics:

Hard Coverage (Chard): Fact matches edit in correct section.
Soft Coverage (Csoft): Fact matches edit anywhere in article.
Section Accuracy (SAcc)
Key Facts Coverage (Editor)
Commentary Information Coverage (Editor)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Editor Test Set	Key Facts Coverage	91.3	91.7	+0.4
Wikipedia 2024 Edits	Soft Coverage (Csoft)	21.5%	34.4%	+12.9%
Wikipedia 2024 Edits	Hard Coverage (Chard)	30.6%	15.4%	-15.2%
User Study	Acceptance Rate	N/A	68% (No Revision) + 29% (With Revision)	N/A

Experiment Figures

Motivation: Human edits often lag months behind source publication.

Automatic evaluation setup comparing agent edits to atomic facts in human edits.

Performance by page category: Sports/Organizations are easier to update than Politicians/Celebrities.

Main Takeaways

Fine-tuning on human edit history produces editors that are more neutral and concise than generic large models like GPT-4o.
Agentic, iterative search significantly improves the discovery of relevant updates compared to single-shot queries.
The 'Hard Coverage' gap suggests that determining *where* to place an update is as challenging as finding the update itself.
WINELL can effectively function as a 'human-in-the-loop' assistant, with high acceptance rates for its suggestions.