MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

📝 Paper Summary

Robotic Manipulation Vision-Language-Action (VLA) Models Memory Systems in Robotics

MemoryVLA integrates a dual-stream memory bank (storing both fine-grained visual details and high-level semantics) into a diffusion-based control policy to enable robots to solve long-horizon, non-Markovian tasks.

Core Problem

Mainstream VLA models rely on single-frame observations, failing to capture temporal dependencies required for non-Markovian tasks where past actions determine current state.

Why it matters:

Many manipulation tasks are visually ambiguous without history (e.g., a button looks the same before and after pressing), leading to execution failures.
Naive solutions like concatenating frame histories are computationally expensive (quadratic attention complexity) and misaligned with single-frame pretraining paradigms.
Current state-of-the-art models like OpenVLA and pi0 struggle significantly with long-horizon tasks due to this lack of explicit temporal modeling.

Concrete Example: In a 'Push Buttons' task (Fig. 1a), the visual observation of a button is identical before and after the push. Without memory of the 'push' action occurring, a standard VLA cannot determine if the sub-goal is complete, leading it to repeat the action or stall.

Key Novelty

Perceptual-Cognitive Memory Bank (PCMB) with Consolidation

Maintains a dual-stream external memory: 'Perceptual' stream for fine-grained visual details and 'Cognitive' stream for high-level semantic summaries derived from a VLM.
Mimics biological memory consolidation by merging temporally adjacent and semantically similar entries when the buffer is full, preserving essential history without memory bloat.
Uses a 'Working Memory' (current frame tokens) to query the long-term PCMB, retrieving and fusing only decision-relevant historical context via a gating mechanism.

Architecture

The end-to-end framework of MemoryVLA, detailing the flow from observation to action via the memory bank.

Evaluation Highlights

+26% improvement over CogACT on real-world long-horizon temporal tasks (83% vs 57%), demonstrating superior temporal reasoning.
Achieves 96.5% success rate on the LIBERO simulation benchmark, outperforming both CogACT and pi0.
Outperforms pi0 by +11.8 points on the challenging Mikasa-Robo benchmark (41.2% vs 29.4%).

Breakthrough Assessment

8/10

Strong conceptual novelty in applying cognitive science memory theories (dual-store, consolidation) to VLA architectures. Demonstrates significant empirical gains on long-horizon tasks against top-tier baselines like pi0.

⚙️ Technical Details

Problem Definition

Setting: End-to-end robotic manipulation via imitation learning

Inputs: Current RGB image I and language instruction L

Outputs: Sequence of future actions {a_1, ..., a_T} (7-DoF: translation, rotation, gripper)

Pipeline Flow

Input Processing: Vision/Text Encoders → Working Memory Construction
Memory Operations: Retrieval from PCMB → Gated Fusion → Consolidation
Action Generation: Memory-Conditioned Diffusion Expert → Action Sequence

System Modules

Vision Encoder (Input Processing)

Extract visual features from current observation

Model or implementation: DINOv2 + SigLIP (concatenated)

Cognition Encoder (Input Processing)

Generate high-level semantic summary (cognitive token)

Model or implementation: LLaMA-7B (part of Prismatic VLM)

Perceptual Compression (Input Processing)

Compress raw visual tokens into compact perceptual working memory

Model or implementation: SE-bottleneck (Squeeze-and-Excitation)

Memory Bank (PCMB) (Memory Operations)

Store historical perceptual and cognitive tokens with positional encoding

Model or implementation: Dual-stream buffer (Perceptual stream + Cognitive stream)

Memory Retrieval & Fusion (Memory Operations)

Retrieve relevant history using working memory as query and fuse with current state

Model or implementation: Transformer Attention + Learned Gating (Sigmoid)

Action Expert

Generate continuous action trajectory conditioned on fused memory

Model or implementation: Diffusion Transformer (DiT) with ~300M parameters

Novel Architectural Elements

Dual-stream Perceptual-Cognitive Memory Bank explicitly separating visual details from semantic gist
Memory consolidation mechanism that merges tokens based on cosine similarity to maintain fixed memory size
Memory-conditioned diffusion architecture where perceptual and cognitive streams are injected via separate attention layers

Modeling

Base Model: Prismatic VLM (7B parameters)

Training Method: End-to-end Imitation Learning with Diffusion Loss

Objective Functions:

Purpose: Minimize difference between predicted and expert action trajectories.

Formally: MSE loss on diffusion noise prediction

Training Data:

SimplerEnv: Bridge v2 (50k steps), RT-1 dataset (80k steps)
Open-X Embodiment dataset (pretraining)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 256
global_batch_size: 256
+ 5 more
optimizer: AdamW (implied)
inference_steps: 10 (DDIM)
guidance_scale: 1.5 (CFG)
action_horizon_T: 16
perceptual_tokens_N: 256

Compute: Training: 8 NVIDIA A100 GPUs. Inference: Single RGB frame input.

Comparison to Prior Work

vs. OpenVLA/pi0: MemoryVLA explicitly models temporal history via PCMB, whereas baselines are Markovian.
vs. RoboFlamingo: MemoryVLA stores fine-grained perceptual tokens alongside semantics, while RoboFlamingo compresses history into a single coarse latent.
vs. UniVLA: MemoryVLA uses visual-semantic memory retrieval, whereas UniVLA relies only on text-based action history.
+ 2 more
vs. TraceVLA [not cited in paper]: TraceVLA paints history on current frames (visual modification), while MemoryVLA uses latent memory retrieval.
vs. Octo: Octo interleaves history in the context window (quadratic cost), while MemoryVLA uses a fixed-size consolidated memory bank.

Limitations

Relies on pretrained VLMs (Prismatic 7B), inheriting their biases and computational cost.
Memory bank capacity (L) and consolidation strategy introduce hyperparameters that may need tuning per task.
The consolidation mechanism merges entries based on similarity, which might occasionally discard subtle but critical temporal transitions.

Reproducibility

Code: https://shihao1895.github.io/MemoryVLA

Project page provided (https://shihao1895.github.io/MemoryVLA). The paper details architecture and hyperparameters (batch size, LR) clearly. Pretrained backbones (Prismatic, DINOv2, SigLIP) are open source.

📊 Experiments & Results

Evaluation Setup

Robotic manipulation in simulation (SimplerEnv, LIBERO, Mikasa) and real-world (Franka, WidowX).

Benchmarks:

SimplerEnv-Bridge (Simulation (WidowX robot))
SimplerEnv-Fractal (Simulation (Google robot, Visual Matching & Aggregation))
LIBERO (Long-horizon simulation tasks)
Mikasa-Robo (Simulation benchmark)
Real-World Suite (Physical robot manipulation (General & Long-horizon)) [New]

Metrics:

Success Rate (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results across multiple standard benchmarks showing MemoryVLA consistently outperforming SOTA baselines.
SimplerEnv-Bridge	Success Rate	57.3	71.9	+14.6
SimplerEnv-Fractal	Success Rate	68.1	72.7	+4.6
Mikasa-Robo	Success Rate	29.4	41.2	+11.8
LIBERO (Average)	Success Rate	Not reported in the paper	96.5	Not reported in the paper
Real-world experiments highlight the specific advantage of MemoryVLA in long-horizon temporal tasks compared to general tasks.
Real-World (Long-horizon Temporal)	Success Score	57.0	83.0	+26.0
Real-World (General)	Success Score	76.0	85.0	+9.0

Experiment Figures

Motivation for MemoryVLA using a 'Push Buttons' task example and a comparison of human vs. MemoryVLA memory systems.

Main Takeaways

MemoryVLA demonstrates significant improvements on long-horizon tasks (up to +26% in real world), validating the importance of explicit temporal modeling.
The method generalizes well across different robot embodiments (Franka, WidowX, Google Robot) and simulation environments.
The dual-memory system (perceptual + cognitive) is crucial; relying on just one or simple concatenation (as in baselines) leads to lower success in non-Markovian scenarios.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Diffusion Models for Robotics
Transformer Attention Mechanisms

Key Terms

VLA: Vision-Language-Action models—systems that take visual and text inputs and directly output robotic control actions

Non-Markovian: Processes where the next state depends on the history of events, not just the current state

PCMB: Perceptual-Cognitive Memory Bank—the proposed module storing history in two streams (visual details and semantic gist)

Working Memory: In this paper, the representation of the current timestep (perceptual + cognitive tokens) used to query long-term history

DiT: Diffusion Transformer—a diffusion model architecture based on Transformers instead of U-Nets

DDIM: Denoising Diffusion Implicit Models—an efficient sampling algorithm for diffusion models

7-DoF: 7 Degrees of Freedom—robot control outputs comprising 3 translation, 3 rotation, and 1 gripper state

SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive vision-language model used here as a visual encoder backbone

SimplerEnv: A simulation environment for evaluating robotic manipulation policies (Bridge and Fractal suites)