← Back to Paper List

EmBARDiment: an Embodied AI Agent for Productivity in XR

Riccardo Bovo, Steven Abreu, Karan Ahuja, Eric J. Gonzalez, Li-Te Cheng, Mar González-Franco
IEEE Conference on Virtual Reality and 3D User Interfaces (2024)
Memory Agent MM Speech

📝 Paper Summary

Memory organization Agentic AI
EmBARDiment filters visual context in XR using eye-gaze fixations to build a concise episodic memory, enabling agents to answer implicit queries about read text without processing entire screen contents.
Core Problem
In XR environments, users have difficulty providing context to AI agents because typing is cumbersome and speech is low-bandwidth, while dumping all visible screen text into the context window creates noise and latency.
Why it matters:
  • XR headsets offer rich sensor data (eye tracking) that current chatbots ignore, relying instead on explicit, repetitive voice prompts
  • Providing all text from multiple productivity windows to an LLM is computationally heavy and dilutes relevance, making it hard to maintain nuanced conversations
  • Current explicit input modalities (text/speech) in XR are inefficient for complex knowledge work compared to natural implicit signaling
Concrete Example: If a user has multiple windows open and asks 'summarize this', a standard agent doesn't know which 'this' refers to. Dumping all windows into the context is slow and confusing. EmBARDiment uses gaze history to identify the specific paragraph the user just read.
Key Novelty
Gaze-Driven Contextual Memory
  • Uses real-time eye tracking to detect what text the user is reading (fixations >120ms) and stores only that text in a short-term memory buffer
  • Automatically injects this 'read' text as context into the LLM prompt when the user speaks, establishing a shared theory of mind without explicit selection
  • Combines this implicit context with an embodied avatar that uses visemes (lip sync) to provide grounded, naturalistic responses
Architecture
Architecture Figure Figure 1 (referenced in text)
System architecture connecting user inputs (speech, gaze) to the contextual memory and LLM
Breakthrough Assessment
5/10
Proposes a logical integration of XR sensors with LLM context windows for productivity. While the idea of gaze-for-context is established, the specific implementation for continuous episodic memory in LLMs is a solid application engineering contribution.
×