L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

📝 Paper Summary

Agentic RAG pipeline Legal AI

L-MARS reduces hallucination in legal QA by using a multi-agent workflow that iteratively decomposes queries, executes targeted searches across web/legal databases, and employs a judge agent to verify evidence sufficiency.

Core Problem

General-purpose LLMs in legal contexts often produce hallucinations or confidently state unsupported answers because they lack up-to-date knowledge of evolving laws and fail to perform deep verification.

Why it matters:

Incorrect legal citations or outdated statutes carry significant real-world risk and undermine credibility in decision-making
Standard RAG is brittle: if the single-pass retrieval misses key evidence, the model attempts to answer anyway, leading to incomplete reasoning
Fine-tuning is impractical for law because regulations change rapidly, requiring frequent and costly retraining

Concrete Example: When answering a question about a specific 2025 executive order, a standard RAG model might hallucinate a generic policy based on older training data. L-MARS would detect the specific date requirement, retrieve the actual order text via Serper, and verify it matches the question's jurisdiction before answering.

Key Novelty

Iterative Search-Judge-Refine Loop for Law

Replaces single-pass retrieval with a multi-agent team: a Query Agent plans, a Search Agent targets heterogeneous sources (Web, CourtListener, local RAG), and a Judge Agent enforces strict sufficiency checks
Introduces an 'evidence sufficiency' feedback loop: if the Judge Agent finds the retrieved laws insufficient or contradictory, it triggers specific follow-up searches rather than forcing an answer

Evaluation Highlights

Achieves up to 98% accuracy on LegalSearchQA (a new 2025-focused benchmark), compared to 86–89% for baseline LLMs (GPT-4o, Claude-3.5-Sonnet)
Reduces uncertainty significantly: U-Score drops from 0.55–0.62 (baselines) to 0.39–0.42 (L-MARS), indicating less hedging and vagueness
Human and LLM judges consistently prefer L-MARS answers for factual grounding and reasoning quality, despite higher latency (55.7s vs 1-4s)

Breakthrough Assessment

8/10

Strong practical application of agentic workflows to a high-stakes domain. While the core agentic patterns (looping, reflection) are known, the integration with specific legal backends and the rigorous sufficiency checking make it a robust blueprint for vertical AI.

⚙️ Technical Details

Problem Definition

Setting: Iterative generation of a reasoning chain R and final answer a, given query q, evolving retrieval results D, and optional clarifications U.

Inputs: User query q (legal question)

Outputs: Final answer a and reasoning chain R

Pipeline Flow

Query Agent (parses user question)
Search Agent (retrieves evidence from Web/Local/CourtListener)
Judge Agent (evaluates evidence sufficiency)
Conditional Loop (if insufficient -> refine search; if sufficient -> proceed)
Summary Agent (synthesizes final answer)

System Modules

Query Agent

Parses user question into structured query result (issue type, jurisdiction, time window, search intents) and proposes clarifying sub-questions

Model or implementation: GPT-4o/Claude-3.5/Gemini (configurable)

Search Agent

Executes tool calls to retrieval backends (Serper, BM25 local, CourtListener) and normalizes results

Model or implementation: Not a distinct model; tool-using component

Judge Agent

Evaluates sufficiency of retrieved evidence against a checklist (jurisdiction, date, contradictions)

Model or implementation: GPT-o3 (default T=0)

Summary Agent

Composes the final response with citations and rationale once evidence is sufficient

Model or implementation: GPT-4o/Claude-3.5/Gemini (configurable)

Novel Architectural Elements

Integration of heterogeneous legal sources (Web + Local RAG + CourtListener API) within a unified agentic loop
Explicit 'Judge Agent' node that acts as a gatekeeper, forcing iterative refinement based on a legal-specific sufficiency checklist (jurisdiction, temporal validity)

Modeling

Base Model: Evaluated with GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Flash acting as the agent backbone

Reproducibility

Code: https://github.com/boqiny/L-MARS

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Familiarity with agentic workflows (LangGraph, tool use)
Basic knowledge of legal citation and jurisdiction

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

LangGraph: A library for building stateful, multi-agent applications with LLMs using graph structures

Serper API: A tool that enables LLMs to perform Google searches and retrieve web results

CourtListener: A legal research platform providing APIs for accessing US court opinions and filings

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a search query

U-Score: A custom uncertainty metric proposed in this paper combining hedging, temporal vagueness, citation sufficiency, jurisdictional specificity, and decisiveness

DAG: Directed Acyclic Graph—a structure used here to define the flow of control between agents

chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer