Level-1 Factual Memory: Memory tasks where information is explicitly stated and can be directly recalled (e.g., 'What is my dog's name?')
Level-2 Cognitive Memory: Memory tasks requiring the retention and application of implicit constraints like user goals, values, or states (e.g., behaving supportively because the user was sad earlier)
cue–trigger semantic disconnect: A scenario where the immediate user query (trigger) has no keywords or semantic similarity to the relevant past information (cue), preventing simple retrieval shortcuts
task disclosure: The common practice of explicitly telling the model 'This is a memory task' in the prompt, which the authors argue biases evaluation
constraint consistency: An evaluation metric that checks if a response adheres to a behavioral rule derived from history, rather than checking if it matches a specific text string
BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query based on keyword matching
MPNet: A sentence embedding model used here to measure semantic similarity and filter out easy cases where the cue and trigger are too similar