← Back to Paper List

GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

Md Selim Sarowar, Omer Tariq, Sungho Kim
Yeungnam University, Korea Advanced Institute of Science and Technology
arXiv (2026)
MM Reasoning Agent

📝 Paper Summary

Vision-Language-Action (VLA) Models 3D Scene Representation Robot Manipulation
GST-VLA improves robot manipulation precision by converting visual inputs into structured 3D Gaussian primitives and enforcing explicit intermediate spatial reasoning before action generation.
Core Problem
Standard VLA models rely on 2D patches that lack intrinsic geometry, while scalar depth injection provides no information on surface orientation or confidence and allows no mechanism to verify spatial understanding before acting.
Why it matters:
  • Implicitly recovering 3D structure from 2D tokens degrades as task precision increases (e.g., millimeter-scale edge grasping)
  • Pixel-uniform depth tokens waste representational budget on background regions rather than task-relevant geometry
  • Current models collapse scene interpretation and action generation into a single black box, making the spatial reasoning pathway non-inspectable
Concrete Example: In an edge grasping task, a flat surface and a sharp edge at the same depth produce identical scalar depth values. A standard model cannot distinguish the local curvature needed to orient the gripper, whereas GST-VLA's covariance parameter explicitly encodes this surface orientation.
Key Novelty
Gaussian Spatial Tokenizer (GST) & Depth-Aware Chain-of-Thought (DA-CoT)
  • Replaces scalar depth pixels with anisotropic 3D Gaussian tokens that explicitly encode position, surface orientation (via covariance), and geometric confidence (via opacity)
  • Introduces a supervised intermediate reasoning stage where the model must generate explicit 3D thoughts (e.g., object centroids, grasp points) before generating action tokens
Evaluation Highlights
  • Achieves 96.4% success rate on LIBERO benchmark (+2.0% over state-of-the-art)
  • Achieves 80.2% success rate on SimplerEnv (+5.4% over state-of-the-art)
  • Ablation confirms 3D Fourier positional encodings contribute significantly to performance (removing them costs 2.8 percentage points)
Breakthrough Assessment
8/10
Strong methodological contribution by integrating explicit 3D Gaussian priors into VLM token space, addressing the critical lack of geometric structure in standard VLAs.
×