NanoKnow: The proposed benchmark dataset partitioning questions based on whether their answers exist in the FineWeb-Edu pre-training corpus
nanochat: A family of small LLMs (d20, d32, d34) pre-trained entirely on the open FineWeb-Edu corpus, enabling full data transparency
FineWeb-Edu: A 100-billion-token open corpus of educational web content used to pre-train nanochat
parametric knowledge: Knowledge stored within the model's weights (parameters) acquired during pre-training
Supported split: Questions for which the answer string appears in the pre-training corpus in a relevant context
Unsupported split: Questions for which the answer does not appear in the pre-training corpus
LLM-Judge: Using a strong LLM (here Qwen3-14B) to evaluate the correctness of a model's response instead of exact string matching
BM25: A probabilistic information retrieval algorithm used to rank documents based on term frequency and inverse document frequency
RAG: Retrieval-Augmented Generation—providing external documents to an LLM to help it answer questions
distractors: Irrelevant documents provided to the LLM alongside the correct context to test its robustness
shards: Sub-files of a large dataset; FineWeb-Edu is divided into 1,823 parquet shards