Riccardo Bovo, Steven Abreu, Karan Ahuja, Eric J. Gonzalez, Li-Te Cheng, Mar González-Franco
IEEE Conference on Virtual Reality and 3D User Interfaces
(2024)
MemoryAgentMMSpeech
📝 Paper Summary
Memory organizationAgentic AI
EmBARDiment filters visual context in XR using eye-gaze fixations to build a concise episodic memory, enabling agents to answer implicit queries about read text without processing entire screen contents.
Core Problem
In XR environments, users have difficulty providing context to AI agents because typing is cumbersome and speech is low-bandwidth, while dumping all visible screen text into the context window creates noise and latency.
Why it matters:
XR headsets offer rich sensor data (eye tracking) that current chatbots ignore, relying instead on explicit, repetitive voice prompts
Providing all text from multiple productivity windows to an LLM is computationally heavy and dilutes relevance, making it hard to maintain nuanced conversations
Current explicit input modalities (text/speech) in XR are inefficient for complex knowledge work compared to natural implicit signaling
Concrete Example:If a user has multiple windows open and asks 'summarize this', a standard agent doesn't know which 'this' refers to. Dumping all windows into the context is slow and confusing. EmBARDiment uses gaze history to identify the specific paragraph the user just read.
Key Novelty
Gaze-Driven Contextual Memory
Uses real-time eye tracking to detect what text the user is reading (fixations >120ms) and stores only that text in a short-term memory buffer
Automatically injects this 'read' text as context into the LLM prompt when the user speaks, establishing a shared theory of mind without explicit selection
Combines this implicit context with an embodied avatar that uses visemes (lip sync) to provide grounded, naturalistic responses
Architecture
System architecture connecting user inputs (speech, gaze) to the contextual memory and LLM
Breakthrough Assessment
5/10
Proposes a logical integration of XR sensors with LLM context windows for productivity. While the idea of gaze-for-context is established, the specific implementation for continuous episodic memory in LLMs is a solid application engineering contribution.
⚙️ Technical Details
Problem Definition
Setting: Multi-window XR productivity environment where a user reads text and verbally queries an agent
Inputs: User speech (audio), Eye-gaze vectors, Visual frames of open application windows
Outputs: Spoken answer from the agent, animated avatar facial expressions (visemes)
Pipeline Flow
Screen Capture & OCR (Extract text from windows)
Gaze Filtering (Select text based on eye fixation)
Memory Update (Push text to FIFO buffer)
Query Processing (Combine speech + memory -> LLM)
Embodiment (LLM Response -> TTS + Animation)
System Modules
WindowMirror / OCR
Captures PC windows into XR and extracts text and bounding boxes
Model or implementation: Google Vision API
Gaze-Driven Contextual Memory
Filters visible text to only include what the user actually read
Model or implementation: Heuristic Fixation Logic (Threshold > 120ms)
LLM Agent
Generates the response based on the user's query and the gaze-selected context
Model or implementation: ChatGPT-4 (API)
Embodiment Engine
Converts text response to speech and animates the avatar
Model or implementation: Google Cloud Text-to-Speech API
Novel Architectural Elements
Gaze-driven episodic memory buffer: A FIFO queue that specifically stores OCR'd text intersected by gaze fixations (>120ms) to serve as implicit LLM context
Modeling
Base Model: ChatGPT-4 (via API)
Compute: Inference only (uses external APIs: Google Vision, Google Speech-to-Text, OpenAI API)
Comparison to Prior Work
vs. Nimble/MiseUnseen: EmBARDiment focuses on productivity/reading context via temporal memory (episodic buffer) rather than immediate directional pointing or spatial arrangement
vs. Standard Chatbots (ChatGPT/Claude): Introduces implicit visual context inputs (gaze) rather than relying solely on explicit text/image uploads
Limitations
Dependency on external APIs (Google, OpenAI) introduces latency
Contextual memory is limited to a small buffer (250 words) and clears after every request
Relies on accuracy of OCR and Eye-tracking calibration
Evaluation results not present in the provided text snippet
Code available at https://emBARDiment.github.io. Uses commercial APIs (Google Cloud, OpenAI) which may require keys/payment to replicate.
📊 Experiments & Results
Evaluation Setup
User study with reading comprehension tasks in a multi-window XR environment
Benchmarks:
Custom Reading Task (Question Answering based on 3 texts (Quantum Computing themes)) [New]
Metrics:
Implicit feedback (HLMIQ survey)
Number of attempts to get correct answer
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
The paper describes a study design comparing three conditions: Baseline (no context), Full Context (all window text), and Eye-Tracking (gaze-selected text).
The study aims to evaluate if gaze-driven context reduces the need for explicit prompt engineering and makes interaction more natural.
Note: The provided text for this summary ends at Section 3.2 (Design), so no quantitative results or findings are available to report.
📚 Prerequisite Knowledge
Prerequisites
Basic understanding of Extended Reality (XR) and eye-tracking
Familiarity with LLM prompting and context windows
Knowledge of OCR (Optical Character Recognition) pipelines
Key Terms
XR: Extended Reality—an umbrella term for virtual, augmented, and mixed reality environments
OCR: Optical Character Recognition—technology that converts images of text (like screen captures) into machine-readable text formats
Visemes: Visual representations of phonemes; the shape the mouth makes when producing a specific sound, used for lip-syncing avatars
Saliency: The quality of being noticeable or important; here, determined by where the user's eyes are fixated
Fixation: A period where the eyes remain relatively still on a specific point (defined here as >120ms), allowing visual processing
LLM: Large Language Model—AI models designed to understand and generate human language