NIAH: Needle In A Haystack—a standard benchmark testing if a model can retrieve a specific piece of information buried in a large context.
Atomic Tests: Tests designed to evaluate individual memory capabilities in isolation, such as searching, recalling, or matching.
Composite Tests: Tests that combine multiple atomic capabilities to simulate complex, real-world scenarios involving boundaries and interactions between memory segments.
Stateful Processing: Tasks requiring the model to track the changing state of an entity (e.g., a number or a set) through a sequence of operations defined in the context.
Theory of Mind: In this specific benchmark, a composite test requiring the model to track knowledge states of different entities and information flow between them.
ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)—a metric measuring text overlap between generated and reference summaries.
Jaccard Similarity: A statistic used for comparing the similarity and diversity of sample sets (size of intersection divided by size of union).