← Back to Paper List

LLMs as Repositories of Factual Knowledge: Limitations and Solutions

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi
University of Trento
arXiv (2025)
Factuality Benchmark RAG KG

📝 Paper Summary

Factual Knowledge Evaluation Knowledge Internalization
The paper introduces a dynamic benchmark (DyKnow) to assess LLM temporal factuality and proposes Entity-Aware Fine-tuning (ENAF) to improve knowledge consistency by embedding structured entity identifiers.
Core Problem
LLMs trained on static snapshots often generate outdated or inconsistent responses to time-sensitive factual queries because their internal knowledge is fragmented across different lexicalizations and timestamps.
Why it matters:
  • Static benchmarks (e.g., LAMA) quickly become outdated and are prone to data contamination, failing to measure a model's ability to handle real-time changes.
  • Inconsistent knowledge representation leads to contradictory answers when prompts are slightly perturbed (e.g., 'CR7' vs 'Cristiano Ronaldo'), undermining user trust.
  • Existing knowledge editing methods (ROME, MEMIT) often fail to generalize across different entity lexicalizations or scale to real-world scenarios.
Concrete Example: When asked 'Which club does Cristiano Ronaldo play for?', an LLM might answer 'Juventus' (outdated) or 'Manchester United' (outdated) depending on whether the prompt uses 'Cristiano' or 'CR7', failing to retrieve the current truth (Al-Nassr).
Key Novelty
Dynamic Benchmarking (DyKnow) + Entity-Aware Fine-tuning (ENAF)
  • DyKnow: A benchmark framework that generates questions from live Wikidata snapshots at evaluation time, ensuring ground truth is always current and distinguishing between 'outdated' and 'incorrect' answers.
  • ENAF: A soft neurosymbolic fine-tuning approach that maps different text variations of an entity (e.g., 'CR7', 'Ronaldo') to a single structured identifier (entity tag or ID) within the model's training data.
Evaluation Highlights
  • State-of-the-art models like Llama-3 and GPT-4 still produce outdated or irrelevant answers for >20% of time-sensitive queries.
  • ENAF improves prompt consistency (agreement across perturbations) significantly compared to standard fine-tuning (e.g., +15-20% consistency gain on Llama-2-7B).
  • Retrieval-Augmented Generation (RAG) generally outperforms parametric knowledge editing methods (ROME, MEMIT) in accuracy for time-sensitive facts.
Breakthrough Assessment
7/10
Strong contribution in defining dynamic benchmarking for temporal facts. The proposed neurosymbolic fine-tuning (ENAF) is a logical step for consistency, though the paper confirms RAG remains superior for pure accuracy on dynamic data.
×