Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

📝 Paper Summary

Agentic RAG pipeline Tool-use with flexible plan

Fanar-Sadiq improves Islamic QA reliability by routing queries to specialized agents—using deterministic calculators for obligations and exact lookup for scripture—rather than forcing all inputs through a single generative pipeline.

Core Problem

Standard RAG pipelines fail to handle the heterogeneity of Islamic queries, often hallucinating scripture or miscalculating strict arithmetic obligations like Zakat and inheritance.

Why it matters:

Fabricating Quranic verses or misattributing Hadith in religious applications carries high stakes and can mislead users on canonical matters
Religious obligations like Zakat and inheritance require strict, rule-based arithmetic that probabilistic LLMs often fail to execute correctly
A 'one-size-fits-all' retrieve-then-generate approach cannot distinguish between requests requiring verbatim lookup, jurisprudential reasoning, or symbolic computation

Concrete Example: When asked to calculate inheritance or Zakat, a standard LLM might produce a plausible-sounding but mathematically invalid distribution that violates Shariah invariants. Similarly, it might paraphrase a Quranic verse (paraphrase drift) when the user requires an exact, verified quotation.

Key Novelty

Intent-Routed Multi-Agent Architecture

Classifies user queries into granular intents (e.g., Fiqh, Zakat, Scripture, Greeting) using a hybrid router (LLM + prototype embeddings)
Routes execution to specialized modules: deterministic engines for math/dates, NL2SQL for statistics, and verified RAG for jurisprudence, ensuring the execution mode matches the query constraints

Architecture

The multi-agent system architecture showing the routing logic and specialized tool execution paths.

Evaluation Highlights

+17.2% accuracy improvement on the IslamicFaithQA benchmark compared to the base Fanar-2-27B model, demonstrating the value of the agentic architecture
Achieves 85.5% accuracy on PalmX (Islamic Culture), outperforming GPT-5 (82.3%) and Gemini-3-Pro (84.4%)
Surpasses GPT-5 on the FatwaQA generative benchmark (65.1% vs 63.6%) by leveraging specialized retrieval and citation grounding

Breakthrough Assessment

8/10

Strong practical contribution demonstrating that domain-specific routing and deterministic tools significantly outperform generalist LLMs on high-stakes religious tasks.

⚙️ Technical Details

Problem Definition

Setting: Domain-specific Question Answering covering retrieval, exact lookup, and rule-based computation

Inputs: Natural language query (Arabic/English)

Outputs: Grounded answer with citations, calculated values, or verbatim scripture

Pipeline Flow

Hybrid Query Classifier (Intent Classification)
Router (Selects execution path)
Execution (Tool Actions / Calculation / RAG / Quran Lookup)
Response Assembler (Integrates outputs and citations)

System Modules

Hybrid Query Classifier

Classify query into 9 intents (e.g., Fiqh, Zakat, Quran) and detect language

Model or implementation: LLM (Zero-temp) with Prototype Embedding fallback

Calculation Agents (Execution)

Perform deterministic rule-based computations for obligations

Model or implementation: Deterministic Python Engines (Zakat Calculator, Inheritance Calculator)

Quran Retrieval Tool (Execution)

Retrieve verbatim verses or compute exact statistics

Model or implementation: NL2SQL (Fine-tuned Qwen-4B) + SQLite Database

Fiqh RAG Agent (Execution)

Answer jurisprudential questions with grounded evidence

Model or implementation: Fanar LLM Agent + Dense Retriever

Novel Architectural Elements

Hybrid routing mechanism combining LLM reasoning with embedding-based prototype fallback for robustness
Integration of deterministic 'Calculator' agents for religious obligations alongside probabilistic RAG agents

Modeling

Base Model: Fanar (based on Llama or similar open weights, paper cites Team et al., 2025)

Training Method: Supervised Fine-Tuning (SFT) for NL2SQL module

Adaptation: LoRA SFT

Training Data:

48k template-generated (NL, SQL) pairs for Quranic statistics

Key Hyperparameters:

nl2sql_temperature: 0.1
fiqh_temperature: 0.1
max_sources: 12

Compute: Not reported in the paper

Comparison to Prior Work

vs. AFTINA/FARSIQA: Fanar-Sadiq uses a multi-agent router to invoke deterministic calculators and exact lookup tools, rather than relying solely on a retrieve-generate pipeline
vs. GPT-4/Gemini: Incorporates domain-specific constraints (e.g., inheritance math, verbatim scripture) via tools, reducing hallucination in high-stakes religious queries

Limitations

Dependency on the coverage and quality of the underlying retrieval corpora (500k documents)
Deterministic calculators may not cover all schools of thought (Madhhabs) or complex edge cases
Routing errors can misdirect queries (e.g., sending a calculation question to the Fiqh RAG agent)
Evaluation relies partly on LLM-as-a-judge (GPT-4.1), which may miss nuances in religious reasoning

📊 Experiments & Results

Evaluation Setup

End-to-end system evaluation on public Islamic QA benchmarks (Generative and MCQ)

Benchmarks:

IslamicFaithQA (Generative QA (Faithfulness))
FatwaQA (Generative QA (Jurisprudence))
PalmX (Multiple Choice (Islamic Culture))
QIAS (Task 1) (Multiple Choice (Inheritance Reasoning))
IslamTrust (Multiple Choice (Ethical Alignment))

Metrics:

Accuracy (Exact Match for MCQ)
% Correct (LLM-Judge for Generative)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generative QA results showing improvements in faithfulness and jurisprudence via the agentic architecture.
IslamicFaithQA	Accuracy (LLM-Judge)	48.2	65.4	+17.2
FatwaQA	Accuracy (LLM-Judge)	63.6	65.1	+1.5
Multiple Choice Question (MCQ) results comparing general knowledge and specialized reasoning.
PalmX	Accuracy	82.3	85.5	+3.2
QIAS T1	Accuracy	94.5	72.2	-22.3

Main Takeaways

Agentic routing significantly improves faithfulness in open-ended Islamic QA compared to monolithic models, as seen in the +17.2% gain on IslamicFaithQA.
Specialized tools allow the system to outperform GPT-5 on cultural (PalmX) and jurisprudential (FatwaQA) benchmarks.
The system still struggles to map internal deterministic calculations to external multiple-choice formats (QIAS), highlighting a gap between symbolic reasoning and standardized testing formats.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Basic knowledge of LLM tool use and agentic workflows
Familiarity with Islamic jurisprudence concepts (Fiqh, Zakat)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Fiqh: Islamic jurisprudence; the human understanding and application of Shariah (divine law)

Zakat: A mandatory form of almsgiving in Islam, calculated based on specific asset thresholds and rates

Hadith: A collection of traditions containing sayings of the Prophet Muhammad, used as a major source of religious law

NL2SQL: Natural Language to SQL—converting human questions into database queries to retrieve exact statistics

Madhhab: A school of thought within Islamic jurisprudence (e.g., Hanafi, Shafi'i)

Nisab: The minimum amount of wealth a Muslim must possess before becoming liable for Zakat

Uthmanic: The standard written style/script of the Quran

LLM-judge: Using a strong LLM (like GPT-4) to evaluate the correctness of another model's responses