← Back to Paper List

VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Salesforce AI Research
arXiv, 3/2026 (2026)
Memory RAG Agent Speech

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline
VoiceAgentRAG decouples retrieval from generation using a background agent that predicts and prefetches follow-up topics into a local cache, enabling sub-millisecond retrieval for real-time voice conversations.
Core Problem
Standard RAG retrieval steps introduce 50–300ms of latency, which, when combined with speech-to-text and generation, exceeds the 200ms budget required for natural voice interactions.
Why it matters:
  • Voice-enabled AI agents require sub-200ms response times to feel natural; current RAG approaches break this flow due to network round-trips.
  • Existing solutions like faster indices (HNSW) don't eliminate network latency, and standard semantic caches are reactive (only helping on repeated queries), failing to address the dynamic nature of conversation.
  • Speculative retrieval methods typically operate within a single query's lifecycle rather than exploiting the thinking time between conversation turns.
Concrete Example: In a customer service voice call, if a user asks 'What are the pricing details?', a standard RAG system pauses for ~120ms to fetch data from a cloud vector DB. VoiceAgentRAG predicts this question while answering the *previous* turn, pre-loading the pricing chunks so the retrieval takes 0.35ms.
Key Novelty
Dual-Agent Predictive Memory Router
  • Separates the system into a 'Fast Talker' (foreground, reads from cache) and a 'Slow Thinker' (background, predicts future topics).
  • Utilizes the user's listening and thinking time during the current turn to speculatively retrieve and cache documents for likely *next* turns.
  • Indexes the semantic cache by *document* embeddings rather than query embeddings to ensure retrieval accuracy even when the user's phrasing differs from the predicted query.
Evaluation Highlights
  • 316x retrieval speedup on cache hits (110ms -> 0.35ms) compared to direct Qdrant Cloud queries.
  • 75% overall cache hit rate across 200 queries in 10 diverse conversation scenarios (79% on warm turns).
  • Saved 16.5 seconds of cumulative retrieval latency across 150 cache hits in the evaluation benchmark.
Breakthrough Assessment
8/10
Significant architectural innovation for the specific constraint of real-time voice. Effectively applies the 'System 1 / System 2' concept to system engineering, solving the network latency bottleneck for RAG.
×