EmBARDiment: an Embodied AI Agent for Productivity in XR

📝 Paper Summary

Memory organization Agentic AI

EmBARDiment filters visual context in XR using eye-gaze fixations to build a concise episodic memory, enabling agents to answer implicit queries about read text without processing entire screen contents.

Core Problem

In XR environments, users have difficulty providing context to AI agents because typing is cumbersome and speech is low-bandwidth, while dumping all visible screen text into the context window creates noise and latency.

Why it matters:

XR headsets offer rich sensor data (eye tracking) that current chatbots ignore, relying instead on explicit, repetitive voice prompts
Providing all text from multiple productivity windows to an LLM is computationally heavy and dilutes relevance, making it hard to maintain nuanced conversations
Current explicit input modalities (text/speech) in XR are inefficient for complex knowledge work compared to natural implicit signaling

Concrete Example: If a user has multiple windows open and asks 'summarize this', a standard agent doesn't know which 'this' refers to. Dumping all windows into the context is slow and confusing. EmBARDiment uses gaze history to identify the specific paragraph the user just read.

Key Novelty

Gaze-Driven Contextual Memory

Uses real-time eye tracking to detect what text the user is reading (fixations >120ms) and stores only that text in a short-term memory buffer
Automatically injects this 'read' text as context into the LLM prompt when the user speaks, establishing a shared theory of mind without explicit selection
Combines this implicit context with an embodied avatar that uses visemes (lip sync) to provide grounded, naturalistic responses

Architecture

System architecture connecting user inputs (speech, gaze) to the contextual memory and LLM

Breakthrough Assessment

5/10

Proposes a logical integration of XR sensors with LLM context windows for productivity. While the idea of gaze-for-context is established, the specific implementation for continuous episodic memory in LLMs is a solid application engineering contribution.

⚙️ Technical Details

Problem Definition

Setting: Multi-window XR productivity environment where a user reads text and verbally queries an agent

Inputs: User speech (audio), Eye-gaze vectors, Visual frames of open application windows

Outputs: Spoken answer from the agent, animated avatar facial expressions (visemes)

Pipeline Flow

Screen Capture & OCR (Extract text from windows)
Gaze Filtering (Select text based on eye fixation)
Memory Update (Push text to FIFO buffer)
Query Processing (Combine speech + memory -> LLM)
Embodiment (LLM Response -> TTS + Animation)

System Modules

WindowMirror / OCR

Captures PC windows into XR and extracts text and bounding boxes

Model or implementation: Google Vision API

Gaze-Driven Contextual Memory

Filters visible text to only include what the user actually read

Model or implementation: Heuristic Fixation Logic (Threshold > 120ms)

LLM Agent

Generates the response based on the user's query and the gaze-selected context

Model or implementation: ChatGPT-4 (API)

Embodiment Engine

Converts text response to speech and animates the avatar

Model or implementation: Google Cloud Text-to-Speech API

Novel Architectural Elements

Gaze-driven episodic memory buffer: A FIFO queue that specifically stores OCR'd text intersected by gaze fixations (>120ms) to serve as implicit LLM context

Modeling

Base Model: ChatGPT-4 (via API)

Compute: Inference only (uses external APIs: Google Vision, Google Speech-to-Text, OpenAI API)

Comparison to Prior Work

vs. Nimble/MiseUnseen: EmBARDiment focuses on productivity/reading context via temporal memory (episodic buffer) rather than immediate directional pointing or spatial arrangement
vs. Standard Chatbots (ChatGPT/Claude): Introduces implicit visual context inputs (gaze) rather than relying solely on explicit text/image uploads

Limitations

Dependency on external APIs (Google, OpenAI) introduces latency
Contextual memory is limited to a small buffer (250 words) and clears after every request
Relies on accuracy of OCR and Eye-tracking calibration
Evaluation results not present in the provided text snippet

Reproducibility

Code: https://emBARDiment.github.io

Code available at https://emBARDiment.github.io. Uses commercial APIs (Google Cloud, OpenAI) which may require keys/payment to replicate.

📊 Experiments & Results

Evaluation Setup

User study with reading comprehension tasks in a multi-window XR environment

Benchmarks:

Custom Reading Task (Question Answering based on 3 texts (Quantum Computing themes)) [New]

Metrics:

Implicit feedback (HLMIQ survey)
Number of attempts to get correct answer
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper describes a study design comparing three conditions: Baseline (no context), Full Context (all window text), and Eye-Tracking (gaze-selected text).
The study aims to evaluate if gaze-driven context reduces the need for explicit prompt engineering and makes interaction more natural.
Note: The provided text for this summary ends at Section 3.2 (Design), so no quantitative results or findings are available to report.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Extended Reality (XR) and eye-tracking
Familiarity with LLM prompting and context windows
Knowledge of OCR (Optical Character Recognition) pipelines

Key Terms

XR: Extended Reality—an umbrella term for virtual, augmented, and mixed reality environments

OCR: Optical Character Recognition—technology that converts images of text (like screen captures) into machine-readable text formats

Visemes: Visual representations of phonemes; the shape the mouth makes when producing a specific sound, used for lip-syncing avatars

Saliency: The quality of being noticeable or important; here, determined by where the user's eyes are fixated

Fixation: A period where the eyes remain relatively still on a specific point (defined here as >120ms), allowing visual processing

LLM: Large Language Model—AI models designed to understand and generate human language