NanoKnow: How to Know What Your Language Model Knows

📝 Paper Summary

Benchmark datasets Knowledge probing in LLMs Retrieval-Augmented Generation (RAG) analysis

NanoKnow creates a benchmark by projecting standard QA datasets onto the fully open FineWeb-Edu corpus, allowing researchers to precisely measure how pre-training data presence influences an LLM's knowledge and RAG performance.

Core Problem

It is currently difficult to know if an LLM answers a question because it memorized the answer during pre-training or because it is reasoning over provided context, primarily because pre-training data is usually a 'black box'.

Why it matters:

Understanding knowledge origins is crucial for distinguishing between memorization and reasoning capabilities
Researchers cannot accurately evaluate RAG systems without knowing if the model already knows the answer from its training data
Lack of transparency prevents studying how data frequency affects recall or how parametric knowledge interacts with external evidence

Concrete Example: If an LLM answers 'Who won the 1996 World Cup?', we don't know if it retrieved that fact from an external document or if it saw that exact sentence 50 times in its private training corpus. NanoKnow solves this by using a model (nanochat) with fully known training data (FineWeb-Edu).

Key Novelty

Transparent Data-Knowledge Mapping

Partitions existing QA datasets (NQ, SQuAD) into 'supported' (answer exists in pre-training data) and 'unsupported' splits by indexing the exact corpus used for pre-training
Leverages the fully open nanochat model family and FineWeb-Edu corpus to create a controlled environment where every piece of parametric knowledge can be traced back to specific training documents

Architecture

The 3-stage pipeline for constructing the NanoKnow benchmark dataset.

Evaluation Highlights

Closed-book accuracy on Natural Questions roughly doubles for questions with high answer frequency (51+ occurrences) compared to rare answers (1-5 occurrences)
Providing external evidence (RAG) improves accuracy, but models still perform better on 'supported' questions (+19-25% improvement) than 'unsupported' ones even when given the correct context
Distractor documents harm performance significantly: accuracy drops by ~11 points when the correct answer is surrounded by 4 distractors compared to 1

Breakthrough Assessment

9/10

Highly significant resource. By aligning open models with open data, it enables precise scientific inquiry into the 'black box' of LLM knowledge that was previously impossible with closed-source models.

⚙️ Technical Details

Problem Definition

Setting: Benchmarking LLM Question Answering capabilities conditioned on the presence and frequency of answers in the pre-training corpus

Inputs: Natural language questions from NQ and SQuAD

Outputs: Relevance judgments mapping questions to specific shards/offsets in FineWeb-Edu, classifying them as Supported or Unsupported

Pipeline Flow

Candidate Retrieval (BM25 search over FineWeb-Edu)
String Matching (Check if answer string exists in candidates)
LLM Verification (Filter coincidental matches)

System Modules

Anserini Indexer (Data Construction)

Build a searchable BM25 index over the 171GB FineWeb-Edu corpus

Model or implementation: Anserini (Lucene-based)

String Matcher (Data Construction)

Identify potential matches by checking if the answer string appears in retrieved documents

Model or implementation: Rule-based script

LLM Verifier (Data Construction)

Filter out coincidental string matches (e.g., 'Paris' the song vs. 'Paris' the city)

Model or implementation: Qwen3-8B

Modeling

Base Model: nanochat (d20, d32, d34 variants)

Comparison to Prior Work

vs. Carlini et al.: NanoKnow uses a fully open training set to map knowledge explicitly rather than inferring it via attacks
vs. Standard QA Benchmarks (NQ/SQuAD): NanoKnow adds a layer of metadata (supported/unsupported splits) specific to a model's training data
vs. General RAG studies: Allows separating the confounding variable of 'did the model already know this?' from the RAG performance analysis

Limitations

Relies on a specific corpus (FineWeb-Edu) and model family (nanochat), limiting direct transfer of splits to other models like Llama or GPT
Verification uses an LLM (Qwen3-8B), which may have its own biases or errors in judging relevance
Analysis focuses on exact string presence; does not account for knowledge acquired via paraphrases or reasoning across documents

Reproducibility

Code: https://github.com/castorini/NanoKnow

📊 Experiments & Results

Evaluation Setup

Closed-book and Open-book (RAG) Question Answering

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
SQuAD (Reading Comprehension)

Metrics:

Exact Match (EM)
LLM-Judge Accuracy (using Qwen3-14B)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Frequency analysis shows a strong correlation between how often an answer appears in pre-training data and the model's ability to answer it, both in closed-book and open-book settings.
Natural Questions	Accuracy (approx. from Fig 4)	0.15	0.35	+0.20
Model scaling results demonstrate that larger models memorize more, improving closed-book performance significantly.
SQuAD	LLM-Judge Accuracy	0.113	0.330	+0.217
Natural Questions	LLM-Judge Accuracy	0.140	0.409	+0.269
RAG analysis reveals that parametric knowledge complements external evidence; models perform better when they have 'seen' the answer before, even when given the answer context.
SQuAD	LLM-Judge Accuracy	0.627	0.785	+0.158
Natural Questions	LLM-Judge Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
Distractor analysis confirms that irrelevant information harms performance, with the number of distractors linearly degrading accuracy.
SQuAD	LLM-Judge Accuracy	0.478	0.367	-0.111
SQuAD	LLM-Judge Accuracy	0.269	0.254	-0.015

Experiment Figures

Line plot showing QA accuracy vs. Answer Frequency in Pre-training Data for Closed-Book and Open-Book settings.

Main Takeaways

Closed-book accuracy is strongly influenced by answer frequency in pre-training data; models essentially 'memorize' frequent facts.
RAG helps mitigate low memorization (rare answers), but models are still significantly more accurate when the answer was also seen during pre-training (parametric + external synergy).
Smaller models (d20) show less benefit from answer frequency, suggesting a capacity threshold for effective memorization.
Non-relevant information (distractors) actively harms performance, with accuracy decreasing as the number of distractors increases, even if the model 'knows' the answer parametrically.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model pre-training
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of search indexing (BM25)

Key Terms

NanoKnow: The proposed benchmark dataset partitioning questions based on whether their answers exist in the FineWeb-Edu pre-training corpus

nanochat: A family of small LLMs (d20, d32, d34) pre-trained entirely on the open FineWeb-Edu corpus, enabling full data transparency

FineWeb-Edu: A 100-billion-token open corpus of educational web content used to pre-train nanochat

parametric knowledge: Knowledge stored within the model's weights (parameters) acquired during pre-training

Supported split: Questions for which the answer string appears in the pre-training corpus in a relevant context

Unsupported split: Questions for which the answer does not appear in the pre-training corpus

LLM-Judge: Using a strong LLM (here Qwen3-14B) to evaluate the correctness of a model's response instead of exact string matching

BM25: A probabilistic information retrieval algorithm used to rank documents based on term frequency and inverse document frequency

RAG: Retrieval-Augmented Generation—providing external documents to an LLM to help it answer questions

distractors: Irrelevant documents provided to the LLM alongside the correct context to test its robustness

shards: Sub-files of a large dataset; FineWeb-Edu is divided into 1,823 parquet shards