MEM: Multi-Scale Embodied Memory for Vision Language Action Models

📝 Paper Summary

Memory organization Agentic AI

MEM equips robot policies with two distinct memory scales: a dense video encoder for immediate physical context and a compressed text-based semantic memory for tracking long-term task progress.

Core Problem

Robots need both immediate visual history (to handle occlusions/dynamics) and long-term semantic history (to track recipe steps), but encoding full video history for long tasks is computationally intractable.

Why it matters:

Encoding minutes of video frames individually explodes inference latency, making real-time robot control impossible
Single-modality memories fail: text lacks spatial precision for grasping, while keyframes lose dynamic information needed for physics estimation
Without long-term memory, robots repeat completed subtasks (e.g., adding an ingredient twice) or fail to recover from temporary visual occlusions

Concrete Example: A robot cooking dinner might need to remember it already added salt 10 minutes ago (long-term semantic fact) while simultaneously needing to remember where a bowl is located now that its arm is blocking the camera (short-term visual occlusion). Current models struggle to handle both timescales efficiently.

Key Novelty

Multi-Scale Embodied Memory (MEM)

Uses a specialized video encoder with factorized spatial-temporal attention to compress recent video history into a fixed number of tokens, enabling dense visual memory without high computational cost
Maintains a running text summary of past events (semantic memory) managed by a high-level policy, which explicitly predicts memory updates rather than storing raw history

Architecture

The MEM architecture showing the Video Encoder and the VLA Backbone interaction.

Evaluation Highlights

Enables robots to solve tasks spanning up to 15 minutes, such as cleaning a whole kitchen or preparing a grilled cheese sandwich
Achieves state-of-the-art performance on complex manipulation tasks by integrating with the π0.6 VLA model
Demonstrates in-context adaptation capabilities, allowing the policy to correct mistakes and handle partial observability using short-term video memory

Breakthrough Assessment

8/10

Strong engineering solution to the context length problem in robotics. Effectively combines the spatial precision of video with the compression of text, enabling significantly longer task horizons than typical frame-stacking approaches.

⚙️ Technical Details

Problem Definition

Setting: Long-horizon robotic manipulation tasks requiring memory of past events and current state estimation

Inputs: Task goal g (text), sequence of dense observations o_{t-T:t} (images, proprioception)

Outputs: Continuous robot actions a_{t:t+H}

Pipeline Flow

Input Processing: Video Encoder + Proprioception Embedding
High-Level Policy: Generates Subtask + Updates Memory
Low-Level Policy: Generates Robot Actions

System Modules

Video Encoder (Input Processing)

Compresses dense sequence of recent images into a compact representation

Model or implementation: Modified ViT (Vision Transformer) with factorized spatial-temporal attention

Proprioception Projector (Input Processing)

Encodes robot joint states into embedding space

Model or implementation: Linear Projection Layer

High-Level Policy (π_HL)

Predicts next subtask instruction and updates the semantic language memory

Model or implementation: VLM Backbone (Gemma3-4B base)

Low-Level Policy (π_LL)

Generates concrete robot actions

Model or implementation: VLM Backbone + Action Expert (Flow-matching + Discrete FAST)

Novel Architectural Elements

Factorized spatial-temporal attention in ViT that drops past tokens after fusing temporal info, matching single-image token count
Recursive language memory update mechanism where the policy predicts its own next memory state based on previous memory and observations

Modeling

Base Model: π0.6 (initialized from Gemma3-4B VLM)

Training Method: Supervised fine-tuning / Imitation Learning

Training Data:

Teleoperated robot demonstrations
Policy rollout data with human corrections
Vision-language tasks
Video-language tasks (e.g., video captioning)
Generated memory update data using an off-the-shelf LLM to summarize subtask history

Key Hyperparameters:

action_expert_parameters: 860M
input_resolution: 448x448 px
camera_streams: Up to 4
+ 2 more
short_term_memory_horizon: Up to 18 frames / 54 seconds (during post-training)
pre_training_horizon: 6 observations (stride 1s)

Compute: Inference runs within hundreds of milliseconds latency budget (exact hardware not specified)

Comparison to Prior Work

vs. Frame Stacking: MEM uses factorized attention to compress video, avoiding the quadratic compute cost of raw frame stacking
vs. Keyframe Memory: MEM maintains dense video information (via the encoder) allowing for dynamics estimation, which sparse keyframes miss
vs. Language-Only Memory: MEM keeps visual context for immediate spatial corrections (re-grasps) while using language for long-term semantic tracking
+ 1 more
vs. RAI (Robot AI) [not cited in paper]: RAI uses retrieval-augmented generation for memory; MEM uses a recurrent summary update mechanism instead of retrieval

Limitations

Depends on the quality of the off-the-shelf LLM used to generate ground-truth memory summaries for training
Video encoder requires modifying the attention pattern of pre-trained ViTs, which might affect transfer learning stability
Explicit split into high-level and low-level policies introduces complexity in training data generation (subtask labeling)

Reproducibility

Code availability is not provided. Model builds on π0.6 and Gemma3-4B. Training data includes proprietary robot demonstrations and generated summaries.

📊 Experiments & Results

Evaluation Setup

Real-world robotic manipulation tasks varying in horizon length and complexity

Benchmarks:

Kitchen Cleanup (Long-horizon sequential manipulation (up to 15 mins)) [New]
Grilled Cheese Preparation (Long-horizon cooking task) [New]
Generalist Manipulation Tasks (Diverse short-horizon tasks (pick-place, articulation))

Metrics:

Success Rate
Task Completion Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Kitchen Cleanup / Grilled Cheese	Task Horizon	Minutes (typical)	15 minutes	Significantly longer

Main Takeaways

MEM enables policies to solve extremely long-horizon tasks (15+ minutes) which are intractable for standard dense-context models.
The combination of modalities is critical: video memory handles short-term occlusions/dynamics, while language memory handles long-term task progress.
The video encoder architecture allows scaling to 54 seconds of dense visual history without the prohibitive latency of naive frame stacking.

📚 Prerequisite Knowledge

Prerequisites

Vision Language Action Models (VLAs)
Transformer architecture (Attention mechanisms)
Hierarchical Reinforcement Learning/Control

Key Terms

VLA: Vision Language Action model—a unified neural network that takes vision and language inputs and directly outputs robot actions

ViT: Vision Transformer—a model architecture that processes images as sequences of patches using attention mechanisms

proprioceptive state: Internal sensing of the robot's own body, such as joint angles or gripper position

RTC: Real-Time Chunking—an inference strategy where the robot predicts a chunk of future actions while simultaneously executing the previous chunk to maintain smooth motion

spatial-temporal attention: An attention mechanism that separates processing of space (pixels within a frame) and time (pixels across frames) to save computation

flow-matching: A generative modeling technique used to predict continuous distributions (like robot actions) by learning vector fields

LLM: Large Language Model—a generic text-processing AI model

VLM: Vision-Language Model—an AI model trained on both images and text

token: The basic unit of data processed by a Transformer (e.g., a word part or an image patch)

inference latency: The time delay between receiving an input and generating a response