← Back to Paper List

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Y Wang, S Feng, Y Tsvetkov, H Hajishirzi
Not reported in the paper
arXiv, 5/2025 (2025)
Factuality Benchmark Pretraining

📝 Paper Summary

Factuality Knowledge Update
ScienceMeter evaluates how well LLMs update scientific knowledge by measuring preservation of old facts, acquisition of new ones, and projection of future discoveries across ten scientific domains.
Core Problem
Scientific knowledge evolves rapidly, but pre-trained LLMs are static; existing update methods (like RAG or fine-tuning) lack comprehensive evaluation on whether they preserve old knowledge while enabling reasoning about future discoveries.
Why it matters:
  • Current methods often focus on just adding new information, neglecting the retention of foundational past knowledge
  • Scientific research requires models to not just memorize facts but generalize to anticipate future or yet-undiscovered claims
  • Without robust updates, LLMs quickly become stale and unreliable for assisting in rapidly advancing fields like medicine and computer science
Concrete Example: When a model is updated with a new paper on materials science, it might correctly learn the new claim (acquisition) but hallucinate that a previously known valid claim is now false (distortion in preservation), or fail to infer a logical next step in the research (failure in projection).
Key Novelty
ScienceMeter Evaluation Framework
  • Defines knowledge updates along three temporal axes: Preservation (past), Acquisition (present/new), and Projection (future), rather than just accuracy on new data
  • Operationalizes scientific knowledge as 'atomic scientific claims' (support/refute) rather than simple factoids, allowing for more rigorous verification
  • Introduces a 'distortion' metric to penalize confident but incorrect answers more heavily than simple lack of knowledge (unknowns)
Architecture
Architecture Figure Figure 2
The ScienceMeter evaluation framework workflow.
Evaluation Highlights
  • The best update methods preserve only 85.9% of existing knowledge and acquire 71.7% of new knowledge on average
  • Future knowledge projection is poor, with the best methods achieving only 37.7% success
  • Inference-time updates (RAG-style) work well for larger models (OLMo2-32B), while smaller models (LLaMA-8B) require training to achieve comparable results
Breakthrough Assessment
8/10
Establishes a rigorous, necessary framework for a critical problem (scientific knowledge updates). The separation of preservation, acquisition, and projection provides deep insight into model limitations.
×