System 2 scaling: Trading test-time computation (e.g., via iterative self-correction or search) for higher accuracy, mimicking deliberate human reasoning
Amortization: Spreading the initial high cost of a resource (here, the compute for generating a critique) over many future uses, reducing the average cost per use
Rubric: A scoring guide used to evaluate performance, consisting of multi-dimensional criteria and behavioral descriptors
Zero-shot: Attempting a task without providing specific examples in the prompt; here, generating a response using only the retrieved memory guidelines
Episodic nature: The limitation where an AI model treats every interaction as a blank slate, forgetting previous context once the session ends
Tool calling: The capability of an LLM to invoke external functions (like 'write_file') to perform actions outside its text generation loop