Full Automation of Goal-driven LLM Dialog Threads with And-Or Recursors and Refiner Oracles

📝 Paper Summary

Agentic RAG pipeline Self-evolving Agentic reasoning

The paper automates deep reasoning by steering LLMs through a recursive descent algorithm that builds And-Or trees of subtasks and alternatives, validating hypotheses against ground-truth embeddings or oracle advice.

Core Problem

steering LLM dialogs toward complex goals requires labor-intensive prompt engineering and often fails to maintain focus or depth when relying on simple one-shot queries.

Why it matters:

Users currently struggle to keep LLMs focussed on a task while digging deep into details without extensive manual intervention
Standard LLM interactions lack a structured memory of the reasoning path, leading to drift and hallucinations in multi-step tasks
Existing logic programming approaches are too rigid for natural language, while LLMs lack the inherent structure to perform rigorous step-by-step logic independently

Concrete Example: When asked to explain a complex causal chain, a standard LLM might provide a surface-level summary or drift into irrelevant topics. In contrast, this system forces a recursive breakdown: first generating alternative causes (OR-step), then breaking each into necessary conditions (AND-step), creating a verifiable trace of justification.

Key Novelty

Logic-Guided Recursive Descent for LLM Dialogs

Adapts the SLD-resolution algorithm from Horn Clause logic to natural language, replacing unification with LLM-generated clause heads and bodies
Maintains an explicit goal stack and context history to steer the LLM, treating the conversation as a proof search rather than a Markov chain
Uses semantic similarity to ground-truth embeddings or 'oracle' LLM agents to validate leaf nodes (abducibles) effectively acting as integrity constraints

Architecture

The system architecture and execution flow, illustrating the interaction between the Python-based logic controller and the LLM API.

Evaluation Highlights

The system successfully generates full justification traces for complex tasks like causal explanations and scientific literature exploration
Demonstrates compilation of natural language dialog threads into executable Propositional Horn Clause programs
Qualitative validation shows the approach produces 'hallucination-free', crisp answers closer to ground truth than standard chat interactions

Breakthrough Assessment

7/10

Novel integration of symbolic logic control flow (SLD-resolution) with neural generation. While experimental results are qualitative, the architectural mapping of logic programming concepts to LLM prompting is highly innovative.

⚙️ Technical Details

Problem Definition

Setting: Automated navigation of a solution space for a user-specified goal via recursive decomposition and verification

Inputs: Succinct task-specific initiator (prompt/goal)

Outputs: A trace of justification steps (reasoning path) and a synthesized Propositional Horn Clause program

Pipeline Flow

Interactor (API management) → Recursor (Logic Steering) → Unfolder (Step Execution) → Refiner (Validation)

System Modules

Interactor

Manages LLM API calls, context memory, persistence (caching), and cost tracking

Model or implementation: GPT-4 or GPT-3.5-turbo

Prompter

Generates dynamic prompts for AND-steps (elaboration) and OR-steps (alternatives) based on current context

Model or implementation: Python dictionary-based templates

AndOrExplorer

Implements recursive descent; yields clause heads (OR-step) and clause bodies (AND-step) emulating a logic interpreter

Model or implementation: Python generator-based coroutine

Refiner

Validates leaf nodes (abducibles) using semantic distance to ground truth or LLM-based advice

Model or implementation: Vector Similarity Search / Oracle LLM

Novel Architectural Elements

Implementation of SLD-resolution using Python generators where the 'knowledge base' is dynamically generated by an LLM on-the-fly
Substitution of logical unification with LLM-based semantic generation of clause heads and bodies
Dual-memory architecture: Short-term context memory (passed to LLM) vs. Long-term goal stack (managed by the Python controller)

Modeling

Base Model: GPT-4 and GPT-3.5-turbo

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT: Recursive depth-first search with backtracking vs. linear generation
vs. ToT: Explicit mapping to Horn Clause logic and abductive reasoning vs. general heuristic search
vs. DSP: Compiles conversation to executable logic program vs. purely textual output
+ 2 more
vs. ReAct: Uses an external control stack (logic engine) to manage state vs. relying on the LLM's internal context window mostly
vs. LangChain [not cited in paper]: Implements a specific recursive logic algorithm (SLD) rather than a general DAG of chains

Limitations

Depends heavily on the underlying LLM's ability to follow strict syntactic output constraints for parsing
Context window limits of the LLM constrain the depth of the short-term memory passed during recursion
Latency can be high due to sequential API calls required by the recursive descent process
No quantitative benchmarking against standard QA datasets provided in the paper

Reproducibility

Code: https://github.com/ptarau/recursors

Code is publicly available at https://github.com/ptarau/recursors. The paper describes the architecture and prompt patterns in detail. No specific training data or fine-tuning was used (inference-only).

📊 Experiments & Results

Evaluation Setup

Qualitative demonstration of the algorithm on specific tasks: causal reasoning, recommendation, and scientific literature exploration.

Metrics:

Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The system can autonomously generate deep, step-by-step reasoning traces starting from a single succinct prompt
The generated dialog threads can be successfully compiled into a minimal model of a Horn Clause program
Semantic similarity checks against ground truth effectively act as 'integrity constraints' to filter out hallucinations during the generation process
The approach is versatile, capable of handling diverse tasks (causal prediction, literature search, recommendation) by swapping the Prompter and Refiner modules

📚 Prerequisite Knowledge

Prerequisites

Logic Programming (Horn Clauses, SLD-resolution)
Abductive Reasoning
LLM Prompt Engineering
Vector Embeddings and Semantic Search

Key Terms

SLD-resolution: Selective Linear Definite clause resolution—a standard algorithm used in logic programming (like Prolog) to prove goals by recursively breaking them down

Horn Clause: A logical rule with at most one positive literal (head), used here to represent implication: 'Head is true if Body is true'

And-Or Tree: A hierarchical structure where some nodes require all children to be true (AND) and others require only one child to be true (OR)

Abducibles: Facts or hypotheses at the bottom of the recursion depth that are tentatively assumed to be true unless contradicted by constraints

Recursor: The core algorithm in this paper that steers the LLM to recursively explore alternatives (OR) and details (AND)

Refiner: A specialization of a Recursor that validates hypotheses using semantic search against ground-truth facts or oracle advice

Embeddings Store: A database storing vector representations of text to allow for semantic similarity search

Oracle: A specialized LLM agent or algorithm used to rate, filter, or validate the truthfulness of a generated hypothesis