OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

📝 Paper Summary

Reasoning benchmarks Structured knowledge integration

OneEval is a comprehensive benchmark assessing LLM reasoning capabilities across four structured knowledge modalities (text, knowledge graphs, code, formal logic) and five domains, revealing severe performance degradation as structural complexity increases.

Core Problem

Existing benchmarks focus predominantly on unstructured textual reasoning, failing to evaluate how LLMs handle structured external knowledge like knowledge graphs, code, or formal logic.

Why it matters:

Real-world applications often require integrating structured data (e.g., databases, formal specs), not just narrative text.
Current LLMs, even powerful ones like DeepSeek-R1 and Grok3, show significant fragility when transitioning from text to structured reasoning.
The lack of diverse modality benchmarks masks blind spots in model capabilities regarding symbolic manipulation and formal logic.

Concrete Example: While models achieve 53% accuracy on textual reasoning, their performance drops to 25% on formal logic tasks within the same benchmark suite, showing they struggle to process explicit structural constraints compared to natural language patterns.

Key Novelty

Multi-Modality Structured Knowledge Benchmark

Unifies evaluation across four distinct knowledge base types (Text, Knowledge Graph, Code, Logic) rather than testing them in isolation.
Introduces a 'Hard' subset specifically curated through empirical failure rates and expert review to prevent saturation and test limits.
Systematically analyzes the correlation between 'knowledge structuredness' (text → logic) and reasoning performance decline.

Architecture

The OneEval framework structure, illustrating the 4 knowledge bases, 5 domains, and the evaluation pipeline.

Evaluation Highlights

Accuracy drops sharply as structure increases: 53% on Textual Reasoning vs. 25% on Formal Logic (average across models).
Even the strongest model (o3) achieves only 32.2% accuracy on the OneEval-Hard subset, highlighting a massive gap in robust reasoning.
Reasoning-focused models (R-LLMs) outperform standard models by ~5.6 points on hard structured tasks, showing better resilience to complexity.

Breakthrough Assessment

8/10

A critical, comprehensive benchmark that exposes the 'structure gap' in current LLMs. It moves beyond simple text QA to rigorous structural evaluation, essential for future neuro-symbolic progress.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-intensive reasoning where models must derive an answer A from a query Q and a retrieved knowledge base S of a specific modality.

Inputs: Query Q (natural language or code) and Knowledge Base S (Text, KG triples, Code snippets, or Logic axioms)

Outputs: Answer A (free text, structured triples, code, or boolean)

Pipeline Flow

Input Query Q
Retrieval (Standardized Dense Retrieval of Knowledge S)
Prompt Construction (Integrate Q and S)
LLM Inference (Answer Generation)

System Modules

Standardized Retriever

Retrieve relevant knowledge context S from the full KB based on similarity to Q

Model or implementation: Dense retrieval (specific model not detailed in main text, treated as fixed environment)

Target LLM

Generate the answer A by reasoning over Q and the provided structured context S

Model or implementation: Various (18 models evaluated)

Novel Architectural Elements

Evaluation framework specifically designed to isolate 'modality' as a variable: Text vs. KG vs. Code vs. Logic, with controlled retrieval.

Modeling

Base Model: Various (Llama3.1, Qwen2.5, DeepSeek, GPT-4, Claude3.7, etc.)

Training Method: Evaluation only (no training proposed)

Adaptation: None (Inference only)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. MMLU: OneEval provides external retrieval context rather than relying on parametric knowledge.
vs. HotpotQA: OneEval includes structured data (KG, Logic) beyond unstructured text.
vs. HumanEval: OneEval evaluates code reasoning as part of a broader multi-modal knowledge assessment.

Limitations

Retrieval module is fixed and may contain noise; does not evaluate the model's ability to improve retrieval itself.
Static datasets may not capture dynamic real-world knowledge updates.
Evaluation focuses on answer accuracy; intermediate reasoning steps are not explicitly scored (except via final output).

Reproducibility

Datasets (OneEval and OneEval-Hard) and evaluation scripts are stated to be released publicly. Retrieval contexts are fixed to ensure fair comparison. Prompts for each task type are provided in Appendix (implied).

📊 Experiments & Results

Evaluation Setup

Knowledge-intensive reasoning with provided external context (RAG-style)

Benchmarks:

OneEval (Knowledge-intensive reasoning (Full set)) [New]
OneEval-Hard (Hard subset of OneEval (high empirical failure rate)) [New]

Metrics:

Accuracy (Overall Score)
F1 score (for specific subsets like BioTextQA)
ISM@1 (for Code tasks)
EM (Exact Match, for Logic tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance on OneEval Full Set shows proprietary models leading, but all models struggle compared to pure text benchmarks.
OneEval (Full Set)	Overall Score (%)	45.8	58.1	+12.3
OneEval (Full Set)	Overall Score (%)	31.0	53.1	+22.1
Performance on OneEval-Hard reveals severe degradation across all models.
OneEval-Hard	Overall Score (%)	58.1	24.7	-33.4
OneEval-Hard	Overall Score (%)	21.0	32.2	+11.2
Impact of knowledge structuredness on performance.
OneEval (Average across models)	Accuracy (%)	53	25	-28

Experiment Figures

Line charts showing model performance trends across increasing levels of knowledge structuredness (Text -> Code -> KG -> Logic).

Performance vs. Output Token Length (proxy for reasoning chain length).

Main Takeaways

Performance consistently declines as knowledge becomes more structured (Text > Code > KG > Logic), indicating models struggle with explicit structural constraints.
Reasoning-enhanced models (R-LLMs) scale better with complexity than standard models, maintaining a larger lead on the Hard subset.
Diminishing returns observed with reasoning chain length: performance peaks at moderate lengths (800-1000 tokens) and degrades thereafter due to noise accumulation.
Textual reasoning correlates well with KG and Code performance on easy tasks, but decouples on hard tasks, suggesting distinct underlying reasoning substrates for symbolic manipulation.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graph (KG) structure (triples, entities, relations)
Formal logic basics (axioms, ontologies)
LLM reasoning paradigms (Chain of Thought, Retrieval-Augmented Generation)

Key Terms

Knowledge Graph (KG): A structured representation of data using a network of entities and relationships, typically formatted as triples (subject, predicate, object).

R-LLM: Reasoning-LLM; models explicitly optimized for reasoning, often using techniques like Chain-of-Thought or reinforcement learning (e.g., o1, DeepSeek-R1).

OneEval-Hard: A subset of the benchmark containing samples where a majority of tested LLMs failed empirically, filtered further by human experts for reasoning complexity.

F1 score: A metric balancing precision and recall, measuring the overlap between the predicted answer and the ground truth.

ISM@1: Input-Similarity-Metric at 1; a metric used for code evaluation to measure functional or semantic correctness of the top generated solution.

Dense retrieval: A method using vector embeddings to find relevant knowledge chunks based on semantic similarity rather than keyword matching.

Logic Base: A formal specification of a domain using concepts, properties, and axioms (rules), requiring deductive reasoning.