AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries

📝 Paper Summary

Agentic RAG pipeline

AOP is a framework that automates the orchestration and optimization of LLM pipelines for complex queries by assembling predefined semantic operators into interactive, self-reflecting execution workflows.

Core Problem

Current data lakes and static LLM pipelines (like basic RAG) struggle with complex queries requiring multi-hop retrieval, logical reasoning, and analytics across heterogeneous data because they lack dynamic planning and interactive adjustment.

Why it matters:

Manual pipeline orchestration is brittle, costly, and requires significant human expertise to design effective workflows
Static pipelines cannot adapt to intermediate failures (e.g., retrieving irrelevant documents), leading to error propagation and incorrect final answers
Existing systems fail to effectively link and analyze heterogeneous data types (structured tables, unstructured documents) simultaneously for complex reasoning

Concrete Example: For the query 'Who are the members of the men's team table tennis champion team at the 2024 Olympic Games?', a standard RAG might fail because it requires multi-hop retrieval: first finding the winning team from news (unstructured), then looking up the roster in a table (structured). AOP handles this by linking the 'champion team' concept to a specific roster lookup.

Key Novelty

Automated Semantic Operator Orchestration

Decomposes complex queries into chains of standard 'semantic operators' (e.g., Semantic Filter, Retrieve, Aggregate) rather than bespoke code
Uses an LLM-based planner to generate initial operator chains, rewrites them into Directed Acyclic Graphs (DAGs) for parallelism, and optimizes them using a cost model
Executes pipelines interactively, allowing the system to inspect intermediate results and dynamically adjust the plan (e.g., pruning paths if retrieval fails)

Architecture

The AOP architecture, illustrating the flow from Query Interface to Planner, Optimizer, Executor, and finally the Answer.

Evaluation Highlights

+45% accuracy improvement on a challenging subset of the CRAG benchmark compared to directly asking the LLM
Reduces execution latency by utilizing parallel execution of independent operators within DAG-structured pipelines
Demonstrates effective handling of heterogeneous data by linking structured tables and unstructured text via semantic operators

Breakthrough Assessment

8/10

Significantly advances Agentic RAG by formalizing 'semantic operators' similar to database algebra, enabling systematic optimization and interactive execution for complex, multi-modal queries.

⚙️ Technical Details

Problem Definition

Setting: Complex query answering over heterogeneous data lakes (structured, semi-structured, and unstructured data)

Inputs: Natural language query

Outputs: Final answer (text or structured data) derived from multi-step reasoning

Pipeline Flow

Query Interface (receives NL query)
Planner/Optimizer (generates operator chains → rewrites to DAGs → selects best plan via cost model)
Executor (runs operators interactively, adjusting based on results)
Context Manager (summarizes intermediate state)

System Modules

Planner (Orchestration)

Orchestrates execution pipelines by selecting appropriate semantic operators for the input query

Model or implementation: LLM-based (Specific model not reported in the paper)

Pipeline Rewriter (Orchestration)

Refines chain pipelines into DAG structures to enable parallel execution

Model or implementation: LLM-based

Cost-Based Optimizer

Selects the most efficient pipeline by estimating computational costs and data cardinality

Model or implementation: Mathematical cost model fitted on sample workloads

Pipeline Executor

Executes the pipeline layer-by-layer with interactive adjustment

Model or implementation: Hybrid (LLM calls + Pre-programmed functions)

Context Manager

Condenses intermediate information to fit within LLM context limits

Model or implementation: LLM-based (Summarize/Explain operators)

Novel Architectural Elements

Use of 22 predefined 'Semantic Operators' (Retrieve, Scan, Filter, etc.) as standardized building blocks for LLM agents
Layer-wise interactive execution engine that pauses to self-reflect and prune/adjust the pipeline graph at runtime
Hybrid physical implementation of operators: some are LLM-based prompts, others are pre-programmed functions (e.g., BM25 search)

Modeling

Base Model: Not reported in the paper

Training Method: Process Reward Model (PRM) training to fine-tune LLM planning

Training Data:

Recorded sequences of chosen pipelines, intermediate results, and final outcomes serve as training data

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAG: AOP uses multi-step dynamic planning with logic operators (Filter, Aggregate) rather than a fixed retrieval step
vs. NL2SQL: AOP supports unstructured and semi-structured data via semantic operators, not just structured tables
vs. LangChain/AutoGPT [not cited in paper]: AOP introduces database-style optimization (cost models, cardinality estimation) and DAG parallelism to agentic workflows

Limitations

Cardinality estimation for unstructured data is challenging and relies on sampling, which may be inaccurate
Cost model parameters require fitting on sample workloads, which may not generalize to all query types
The paper does not specify the base LLM used for experiments, making exact reproduction difficult

Reproducibility

Code availability is not provided. The paper lists the 22 semantic operators and their descriptions, but specific prompts, cost model parameters, and trained model weights are not released.

📊 Experiments & Results

Evaluation Setup

Open-domain Question Answering on the CRAG dataset

Benchmarks:

CRAG (Comprehensive RAG Benchmark) (Complex QA (Multi-hop, Aggregation, Comparison))

Metrics:

Accuracy (Rule-based matching and GPT-4 assessment)
End-to-end latency
Token consumption
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CRAG (Challenging test set)	Accuracy Improvement	Not explicitly reported in the paper	Not explicitly reported in the paper	Not reported in the paper

Experiment Figures

A concrete example of pipeline orchestration for the query 'What is the average height of New York Knicks players that went to college at Villanova?'

Main Takeaways

AOP significantly improves accuracy on complex queries (multi-hop, aggregation) by decomposing them into executable semantic operators
Parallel execution via DAGs and prefetching strategies effectively reduces the latency overhead introduced by multi-step planning
Interactive execution allows the system to recover from retrieval failures by dynamically adjusting the pipeline plan, avoiding error propagation

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Database operator concepts (Select, Project, Join, Aggregate)
Chain-of-Thought (CoT) reasoning

Key Terms

Semantic Operators: Predefined functional units (e.g., Retrieve, Filter, Summarize) used as building blocks for LLM pipelines, similar to SQL operators but for semantic tasks

DAG: Directed Acyclic Graph—a structure used here to represent execution pipelines where independent operators can run in parallel

Prefetching: A technique where the system proactively retrieves potentially useful information for future steps during idle time to reduce latency from retrieval failures

Process Reward Model (PRM): A model that assigns rewards to intermediate steps in a reasoning process, used to fine-tune the LLM's planning capabilities

Schema Linking: The process of identifying and connecting relevant data elements (like table columns or document sections) to the terms in a natural language query