MEM1: Memory-Efficient Mechanism via learning 1-step integrated reasoning and consolidation—the proposed method
Internal State (<IS>): A generated text block that acts as the agent's working memory, summarizing past info and reasoning about next steps
Masked Trajectory: A training technique that reconstructs a full coherent trajectory from fragmented memory steps to allow standard RL policy optimization
Reinforce++: A reinforcement learning algorithm used to optimize the agent's policy
Token-wise Advantage: A measure in RL estimating how much better a specific token choice is compared to the average action
Multi-objective QA: A synthetic task type created by the authors where agents must answer multiple distinct sub-questions (objectives) in a single episode
Context Pruning: Removing tokens from the input prompt (history) to keep the context length manageable
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm (referenced as a comparison for trajectory handling)
KL penalty: Kullback-Leibler divergence penalty—used to prevent the RL policy from drifting too far from the reference model