Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews

📝 Paper Summary

LLM-assisted scientific workflow Automated literature review Human-AI collaboration

LLM-based agents, particularly when using a consensus of high-performing models, can filter thousands of academic papers for systematic reviews with >98% recall in minutes rather than weeks.

Core Problem

Systematic literature reviews (SLRs) require manually screening thousands of papers based on titles and abstracts to identify relevant studies, a process that is labor-intensive, slow, and prone to fatigue.

Why it matters:

Screening 8,000+ papers takes ~66 person-hours of uninterrupted work, often stretching to months in practice due to fatigue and other commitments
Manual filtering suffers from inconsistency and human error, with standard error rates ranging from 0.5% to 9%
The high cost of entry discourages broad exploratory surveys and cross-disciplinary analyses

Concrete Example: In the authors' case study of immersive visual network analysis, researchers had to screen 8,323 candidates. Manually, this took weeks. An individual LLM (Llama-3-8B) flagged 774 false positives, while a human reviewer might miss edge cases due to fatigue.

Key Novelty

LLMSurver: Visual-Interactive Consensus Filtration

A structured pipeline where multiple LLM agents independently classify papers as 'Include' or 'Discard' based on title/abstract using detailed prompt criteria
A consensus voting mechanism (e.g., 'Consensus Best') that combines outputs from top models (GPT-4o, Claude 3.5, Gemini 1.5) to maintain high recall while drastically reducing false positives
A visual-interactive interface allowing researchers to iteratively refine prompts, inspect agent justifications, and resolve disagreements

Architecture

Screenshot of the LLMSurver application interface showing the dashboard layout.

Evaluation Highlights

Consensus (Best) achieved 98.86% recall and 97.99% accuracy on a dataset of 8,323 papers, missing only 1 out of 88 relevant papers.
GPT-4o processed the entire corpus in under 10 minutes for $28.81, compared to an estimated 66+ hours of human labor.
Consensus voting reduced false positives from ~774 (individual Llama-3-8B) to 167, a 98% reduction in the manual validation workload compared to the raw search results.

Breakthrough Assessment

7/10

Strong practical application demonstrating that off-the-shelf LLMs can replace weeks of manual labor with high reliability. While the underlying ML technique is standard prompting, the system integration and rigorous validation against ground truth are valuable.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of academic papers into 'Include' or 'Discard' categories for a systematic review based on metadata

Inputs: Paper title and abstract

Outputs: Binary decision (INCLUDE/DISCARD) and a 2-sentence textual justification

Pipeline Flow

Repository Search (retrieve candidate papers)
Preprocessing (deduplication, metadata unification)
Multi-LLM Classification (independent agents)
Consensus Voting (aggregation)
Human Refinement (visual inspection)

System Modules

Repository Search

Fetch candidate papers from digital libraries

Model or implementation: Keyword-based search engines (ACM, IEEE, Eurographics)

LLM Classifiers (Filtration)

Classify each paper independently based on title and abstract

Model or implementation: Various (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, Llama-3)

Consensus Aggregator (Filtration)

Combine individual LLM votes into a final decision

Model or implementation: Logic-based voter (e.g., Unanimity to Discard)

Novel Architectural Elements

Visual-interactive feedback loop allowing users to refine inclusion/exclusion prompts based on immediate LLM justifications before processing the full corpus

Modeling

Base Model: Evaluated multiple models: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, Llama-3 (8B/70B)

Training Method: Zero-shot prompting with consensus aggregation

Adaptation: None (Prompt engineering only)

Trainable Parameters: 0 (Inference only)

Compute: GPT-4o processing 8.3k papers took <10 minutes ($28.81). Local Llama-3 8B is feasible on consumer hardware.

Comparison to Prior Work

vs. Manual Screening: Orders of magnitude faster (<10 mins vs 66 hours) with comparable or better recall
vs. Single LLM approaches (Haryanto 2024): Uses consensus of multiple models to significantly reduce false positives while maintaining recall [cited in paper]
vs. Gehrmann et al. [cited in paper]: Adds visual-interactive element for prompt refinement and consensus building

Limitations

Evaluated on a single large corpus (8.3k papers) from one specific domain (CS/Visualization), generalization unproven
Relies on title/abstract only; full-text analysis is not performed
Risk of hallucinations or biases in commercial models (though constrained by classification schema)
No controlled user study was conducted to measure the human experience of the tool

Reproducibility

Code: https://github.com/dbvis-ukon/LLMSurver

Code publicly available (https://github.com/dbvis-ukon/LLMSurver). The dataset of 8,323 papers and their ground truth labels is derived from a specific survey ('Visual Network Analysis in Immersive Environments') but the raw data availability depends on the repo. Prompts are described in the paper.

📊 Experiments & Results

Evaluation Setup

Re-creation of a real-world Systematic Literature Review (SLR) on 'Visual Network Analysis in Immersive Environments'

Benchmarks:

Custom SLR Corpus (Binary Classification (Include vs. Discard)) [New]

Metrics:

Recall (sensitivity)
Precision
Accuracy
F1 Score
Statistical methodology: Comparison against human-generated ground truth (88 included, 8235 discarded)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Individual model performance shows a trade-off: larger commercial models have higher precision, while smaller open models (Llama-3-8B) are more conservative (high recall, low precision).
Custom SLR Corpus	Recall	98.86	98.86	0.00
Custom SLR Corpus	Precision	10.10	34.25	+24.15
Custom SLR Corpus	False Positives	774	167	-607
Custom SLR Corpus	False Negatives	5	1	-4

Experiment Figures

Venn-style diagrams or bar charts showing the overlap of False Positives (FP) and False Negatives (FN) across different LLMs.

Comparison of False Positives (FP) across Consensus (All) vs Consensus (Best).

Main Takeaways

Consensus voting (specifically 'Consensus Best' using GPT-4o, Claude, Gemini) offers the best trade-off, maintaining near-perfect recall (>98%) while significantly boosting precision compared to individual open-source models.
Llama-3 8B is a viable 'safe' local option: it has high recall (misses almost nothing) but requires more manual work to filter its many false positives.
The 'Consensus (Best)' approach reduces the manual review pile from 8,323 papers to just ~255 (87 TP + 167 FP), making the task manageable in hours rather than weeks.
The single paper missed by the best models was an edge case that even human reviewers initially struggled with.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Systematic Literature Review (SLR) processes (PRISMA)
Basic familiarity with Large Language Models (LLMs) and prompting

Key Terms

SLR: Systematic Literature Review—a research method that rigorously identifies, selects, and appraises all relevant research on a specific topic

PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses—a standard guideline for reporting systematic reviews

TP/FP/TN/FN: True Positive, False Positive, True Negative, False Negative—standard classification metrics

Recall: The percentage of relevant papers successfully found by the model (crucial for SLRs to avoid missing work)

Precision: The percentage of papers flagged by the model that are actually relevant (important for reducing human workload)

Consensus Voting: A strategy where the final decision is determined by the agreement of multiple different LLM models

Zero-shot learning: The model performs the task without seeing any specific training examples, relying only on the prompt instructions

RAG: Retrieval-Augmented Generation—enhancing LLMs by retrieving relevant external data (mentioned as future work)

F1 score: Harmonic mean of precision and recall, used here to select the best models for consensus

Snowballing: A search technique where references of included papers are recursively checked to find more relevant papers