Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

📝 Paper Summary

Memory recall Dense memory QA

ReQAP decomposes complex personal questions into recursive operator trees that combine SQL-like logic with neural extraction, enabling private, on-device answering over heterogeneous structured and unstructured data.

Core Problem

Personal data is heterogeneous (tables, text, logs) and massive (>100K tokens), making standard RAG fail on context limits and aggregations, while Text-to-SQL fails on unstructured text.

Why it matters:

Privacy requirements demand local processing on user devices, ruling out massive cloud-based LLMs for full context processing
Users need analytical answers (e.g., 'how often did I eat Italian after football?') which require joining structured logs (workouts) with unstructured text (social media) and precise aggregation
Current approaches force a tradeoff: verbalization (RAG) handles text but fails at aggregation; translation (Text-to-SQL) handles aggregation but fails on unstructured text

Concrete Example: For the question 'How often did I eat Italian food after playing football?', a standard SQL generator fails because 'Italian food' might only appear as 'pizza' in a text email body, while 'playing football' might be a calendar entry. ReQAP generates a tree that retrieves candidate events, extracts 'cuisine' from email text using a small LM, and then joins/counts the results.

Key Novelty

Recursive Question Understanding and Decomposition (ReQAP)

Recursive Decomposition: Instead of generating a full query at once, the model recursively breaks a complex question into an operator and a simpler sub-question, refining the tree step-by-step
Hybrid Operators: Introduces `RETRIEVE` (high-recall retrieval with cascade pruning) and `EXTRACT` (using small LMs to dynamically populate virtual columns from text) to bridge structured and unstructured data
Distillation for On-Device Use: Uses In-Context Learning (ICL) on large models to generate training data, then distills this into small (1B-7B) local models that can execute the logic privatively

Architecture

The two-stage process of ReQAP: (1) Question Understanding & Decomposition (QUD) generating the tree, and (2) Operator Tree Execution (OTX) processing the data.

Evaluation Highlights

PerQA benchmark: Constructed a new dataset with 3,500 complex questions and >40,000 events per persona to test analytical reasoning
ReQAP outperforms standard Text-to-SQL baselines significantly on complex aggregation tasks involving unstructured text (specific numbers not in snippet, but qualitative dominance emphasized)
Efficiency: The pruning pipeline in `RETRIEVE` enables scanning massive personal archives by eliminating irrelevant sources (e.g., music streams) early

Breakthrough Assessment

8/10

Strong contribution to privacy-preserving personal QA. The recursive decomposition and hybrid operator tree elegantly solve the structured/unstructured gap that plagues standard RAG and Text-to-SQL.

⚙️ Technical Details

Problem Definition

Setting: Question answering over a temporally ordered list of events, where each event is a dictionary of key-value pairs (some structured, some unstructured text)

Inputs: Natural language question q and a heterogeneous collection of user data streams (calendar, emails, workouts, etc.)

Outputs: A traceable answer derived from executing an operator tree

Pipeline Flow

Group 1 (Planning): Recursive Question Decomposition (LLM) → Operator Tree
Group 2 (Execution): Tree Traversal → Leaf Nodes (RETRIEVE) → Intermediate Nodes (EXTRACT/FILTER/JOIN) → Root (AGGREGATE) → Answer

System Modules

Decomposition Agent

Recursively breaks down questions into partial operator trees and sub-questions until leaf nodes are reached

Model or implementation: Distilled LLaMA (1B parameters) or similar small LM

RETRIEVE Operator (Execution)

Fetches candidate events with high recall and prunes irrelevant sources

Model or implementation: SPLADE (sparse retriever) + Cross-Encoder (classifier)

EXTRACT Operator (Execution)

Dynamically extracts values for non-existent keys (e.g., 'cuisine') from text fields (e.g., email body)

Model or implementation: BART (or similar small Seq2Seq LM)

Novel Architectural Elements

Recursive Decomposition Loop: The planner calls itself to refine sub-questions rather than generating the full plan in one shot
Hybrid Execution Tree: Leaf nodes are neural retrievers; intermediate nodes use neural extraction models to 'fill in' missing schema columns on the fly

Modeling

Base Model: LLaMA (1B parameters) for the distilled QUD model

Training Method: Knowledge Distillation via Supervised Fine-Tuning (SFT)

Trainable Parameters: Full model parameters of the small student model

Training Data:

Generated via In-Context Learning (ICL) on a larger LLM
Filtered by executing generated trees on data and keeping only those yielding correct answers
Cross-encoder training data derived from ground-truth query events in PerQA

Key Hyperparameters:

retrieve_step_score_threshold: 0.1
frozen_mapping_threshold: 70% consistency in first 50 inputs

Compute: Designed to run on end-user devices (mobile/tablet/PC)

Comparison to Prior Work

vs. TimelineQA: PerQA has 2,000+ question templates vs 42, and significantly more diverse/realistic persona data
vs. Verbalization: ReQAP handles data exceeding context windows (>100K tokens) and performs precise numerical aggregation
vs. Translation: ReQAP supports unstructured text via neural `EXTRACT` operators, whereas SQL fails on free text fields
+ 1 more
vs. Binder [not cited in paper]: Binder also combines neural execution with SQL, but ReQAP focuses specifically on the recursive decomposition for personal data privacy and local execution

Limitations

Reliance on synthetic data (PerQA) due to privacy laws preventing release of real personal datasets
The approach assumes data can be modeled as events with key-value pairs (though the model is flexible)
Accuracy depends heavily on the quality of the distilled small models for `EXTRACT` and `RETRIEVE`

Reproducibility

Code: https://reqap.mpi-inf.mpg.de

Code and data available at https://reqap.mpi-inf.mpg.de. The PerQA benchmark includes 3,500 complex questions and synthetic persona data. The paper uses public data sources (Wikidata, Endomondo) for synthesis.

📊 Experiments & Results

Evaluation Setup

QA over synthetic personal data archives (PerQA dataset)

Benchmarks:

PerQA (Complex Question Answering over Personal Data) [New]

Metrics:

Not explicitly reported in the paper text provided (likely Accuracy or Execution Match)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Specific numeric results tables were not included in the provided text snippet. The text claims 'substantial improvements' over baselines but does not list the specific accuracy/F1 numbers.

Main Takeaways

ReQAP effectively bridges the gap between Verbalization (good for text) and Translation (good for structure/aggregation)
Recursive decomposition allows small models (deployable on devices) to handle complex reasoning usually requiring giant LLMs
The `RETRIEVE` operator's pruning pipeline significantly improves efficiency by discarding entire irrelevant data sources (e.g., music logs for a food query)
PerQA provides a much needed realistic benchmark for this domain, with far greater diversity than TimelineQA

📚 Prerequisite Knowledge

Prerequisites

Text-to-SQL / Semantic Parsing
Retrieval-Augmented Generation (RAG)
Knowledge Distillation
In-Context Learning (ICL)

Key Terms

QUD: Question Understanding and Decomposition—the stage where the natural language question is converted into an executable operator tree

OTX: Operator Tree Execution—the stage where the generated tree is run against the data to produce the answer

SPLADE: A sparse retrieval model that learns sparse weightings for tokens, used here for initial high-recall event retrieval

Cross-Encoder: A re-ranking model that processes query and document together for high precision; used here to classify relevant data patterns

Canonicalization: The process of converting raw data from various sources into a standardized key-value event format

ICL: In-Context Learning—providing examples in the prompt to guide the LLM's generation without weight updates

Verbalization: Converting structured data into natural language text so an LLM can process it

SFT: Supervised Fine-Tuning—training a model on a specific dataset to specialize its behavior