Max Planck Institute for Informatics, Saarland Informatics Campus
arXiv, 5/2025
(2025)
RAGMemoryP13NBenchmarkQA
📝 Paper Summary
Memory recallDense memory QA
ReQAP decomposes complex personal questions into recursive operator trees that combine SQL-like logic with neural extraction, enabling private, on-device answering over heterogeneous structured and unstructured data.
Core Problem
Personal data is heterogeneous (tables, text, logs) and massive (>100K tokens), making standard RAG fail on context limits and aggregations, while Text-to-SQL fails on unstructured text.
Why it matters:
Privacy requirements demand local processing on user devices, ruling out massive cloud-based LLMs for full context processing
Users need analytical answers (e.g., 'how often did I eat Italian after football?') which require joining structured logs (workouts) with unstructured text (social media) and precise aggregation
Current approaches force a tradeoff: verbalization (RAG) handles text but fails at aggregation; translation (Text-to-SQL) handles aggregation but fails on unstructured text
Concrete Example:For the question 'How often did I eat Italian food after playing football?', a standard SQL generator fails because 'Italian food' might only appear as 'pizza' in a text email body, while 'playing football' might be a calendar entry. ReQAP generates a tree that retrieves candidate events, extracts 'cuisine' from email text using a small LM, and then joins/counts the results.
Key Novelty
Recursive Question Understanding and Decomposition (ReQAP)
Recursive Decomposition: Instead of generating a full query at once, the model recursively breaks a complex question into an operator and a simpler sub-question, refining the tree step-by-step
Hybrid Operators: Introduces `RETRIEVE` (high-recall retrieval with cascade pruning) and `EXTRACT` (using small LMs to dynamically populate virtual columns from text) to bridge structured and unstructured data
Distillation for On-Device Use: Uses In-Context Learning (ICL) on large models to generate training data, then distills this into small (1B-7B) local models that can execute the logic privatively
Architecture
The two-stage process of ReQAP: (1) Question Understanding & Decomposition (QUD) generating the tree, and (2) Operator Tree Execution (OTX) processing the data.
Evaluation Highlights
PerQA benchmark: Constructed a new dataset with 3,500 complex questions and >40,000 events per persona to test analytical reasoning
ReQAP outperforms standard Text-to-SQL baselines significantly on complex aggregation tasks involving unstructured text (specific numbers not in snippet, but qualitative dominance emphasized)
Efficiency: The pruning pipeline in `RETRIEVE` enables scanning massive personal archives by eliminating irrelevant sources (e.g., music streams) early
Breakthrough Assessment
8/10
Strong contribution to privacy-preserving personal QA. The recursive decomposition and hybrid operator tree elegantly solve the structured/unstructured gap that plagues standard RAG and Text-to-SQL.
⚙️ Technical Details
Problem Definition
Setting: Question answering over a temporally ordered list of events, where each event is a dictionary of key-value pairs (some structured, some unstructured text)
Inputs: Natural language question q and a heterogeneous collection of user data streams (calendar, emails, workouts, etc.)
Outputs: A traceable answer derived from executing an operator tree
Pipeline Flow
Group 1 (Planning): Recursive Question Decomposition (LLM) → Operator Tree
Group 2 (Execution): Tree Traversal → Leaf Nodes (RETRIEVE) → Intermediate Nodes (EXTRACT/FILTER/JOIN) → Root (AGGREGATE) → Answer
System Modules
Decomposition Agent
Recursively breaks down questions into partial operator trees and sub-questions until leaf nodes are reached
Model or implementation: Distilled LLaMA (1B parameters) or similar small LM
RETRIEVE Operator (Execution)
Fetches candidate events with high recall and prunes irrelevant sources
Model or implementation: SPLADE (sparse retriever) + Cross-Encoder (classifier)
EXTRACT Operator (Execution)
Dynamically extracts values for non-existent keys (e.g., 'cuisine') from text fields (e.g., email body)
Model or implementation: BART (or similar small Seq2Seq LM)
Novel Architectural Elements
Recursive Decomposition Loop: The planner calls itself to refine sub-questions rather than generating the full plan in one shot
Hybrid Execution Tree: Leaf nodes are neural retrievers; intermediate nodes use neural extraction models to 'fill in' missing schema columns on the fly
Modeling
Base Model: LLaMA (1B parameters) for the distilled QUD model
Training Method: Knowledge Distillation via Supervised Fine-Tuning (SFT)
Trainable Parameters: Full model parameters of the small student model
Training Data:
Generated via In-Context Learning (ICL) on a larger LLM
Filtered by executing generated trees on data and keeping only those yielding correct answers
Cross-encoder training data derived from ground-truth query events in PerQA
Key Hyperparameters:
retrieve_step_score_threshold: 0.1
frozen_mapping_threshold: 70% consistency in first 50 inputs
Compute: Designed to run on end-user devices (mobile/tablet/PC)
Comparison to Prior Work
vs. TimelineQA: PerQA has 2,000+ question templates vs 42, and significantly more diverse/realistic persona data
vs. Verbalization: ReQAP handles data exceeding context windows (>100K tokens) and performs precise numerical aggregation
vs. Translation: ReQAP supports unstructured text via neural `EXTRACT` operators, whereas SQL fails on free text fields
vs. Binder [not cited in paper]: Binder also combines neural execution with SQL, but ReQAP focuses specifically on the recursive decomposition for personal data privacy and local execution
Limitations
Reliance on synthetic data (PerQA) due to privacy laws preventing release of real personal datasets
The approach assumes data can be modeled as events with key-value pairs (though the model is flexible)
Accuracy depends heavily on the quality of the distilled small models for `EXTRACT` and `RETRIEVE`
Code and data available at https://reqap.mpi-inf.mpg.de. The PerQA benchmark includes 3,500 complex questions and synthetic persona data. The paper uses public data sources (Wikidata, Endomondo) for synthesis.
📊 Experiments & Results
Evaluation Setup
QA over synthetic personal data archives (PerQA dataset)
Benchmarks:
PerQA (Complex Question Answering over Personal Data) [New]
Metrics:
Not explicitly reported in the paper text provided (likely Accuracy or Execution Match)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Specific numeric results tables were not included in the provided text snippet. The text claims 'substantial improvements' over baselines but does not list the specific accuracy/F1 numbers.
Main Takeaways
ReQAP effectively bridges the gap between Verbalization (good for text) and Translation (good for structure/aggregation)
Recursive decomposition allows small models (deployable on devices) to handle complex reasoning usually requiring giant LLMs
The `RETRIEVE` operator's pruning pipeline significantly improves efficiency by discarding entire irrelevant data sources (e.g., music logs for a food query)
PerQA provides a much needed realistic benchmark for this domain, with far greater diversity than TimelineQA
📚 Prerequisite Knowledge
Prerequisites
Text-to-SQL / Semantic Parsing
Retrieval-Augmented Generation (RAG)
Knowledge Distillation
In-Context Learning (ICL)
Key Terms
QUD: Question Understanding and Decomposition—the stage where the natural language question is converted into an executable operator tree
OTX: Operator Tree Execution—the stage where the generated tree is run against the data to produce the answer
SPLADE: A sparse retrieval model that learns sparse weightings for tokens, used here for initial high-recall event retrieval
Cross-Encoder: A re-ranking model that processes query and document together for high precision; used here to classify relevant data patterns
Canonicalization: The process of converting raw data from various sources into a standardized key-value event format
ICL: In-Context Learning—providing examples in the prompt to guide the LLM's generation without weight updates
Verbalization: Converting structured data into natural language text so an LLM can process it
SFT: Supervised Fine-Tuning—training a model on a specific dataset to specialize its behavior