← Back to Paper List

L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Ziqi Wang, Boqin Yuan
University of Southern California, University of California, San Diego
arXiv (2025)
Agent RAG Factuality Benchmark Reasoning

📝 Paper Summary

Agentic RAG pipeline Legal AI
L-MARS reduces hallucination in legal QA by using a multi-agent workflow that iteratively decomposes queries, executes targeted searches across web/legal databases, and employs a judge agent to verify evidence sufficiency.
Core Problem
General-purpose LLMs in legal contexts often produce hallucinations or confidently state unsupported answers because they lack up-to-date knowledge of evolving laws and fail to perform deep verification.
Why it matters:
  • Incorrect legal citations or outdated statutes carry significant real-world risk and undermine credibility in decision-making
  • Standard RAG is brittle: if the single-pass retrieval misses key evidence, the model attempts to answer anyway, leading to incomplete reasoning
  • Fine-tuning is impractical for law because regulations change rapidly, requiring frequent and costly retraining
Concrete Example: When answering a question about a specific 2025 executive order, a standard RAG model might hallucinate a generic policy based on older training data. L-MARS would detect the specific date requirement, retrieve the actual order text via Serper, and verify it matches the question's jurisdiction before answering.
Key Novelty
Iterative Search-Judge-Refine Loop for Law
  • Replaces single-pass retrieval with a multi-agent team: a Query Agent plans, a Search Agent targets heterogeneous sources (Web, CourtListener, local RAG), and a Judge Agent enforces strict sufficiency checks
  • Introduces an 'evidence sufficiency' feedback loop: if the Judge Agent finds the retrieved laws insufficient or contradictory, it triggers specific follow-up searches rather than forcing an answer
Evaluation Highlights
  • Achieves up to 98% accuracy on LegalSearchQA (a new 2025-focused benchmark), compared to 86–89% for baseline LLMs (GPT-4o, Claude-3.5-Sonnet)
  • Reduces uncertainty significantly: U-Score drops from 0.55–0.62 (baselines) to 0.39–0.42 (L-MARS), indicating less hedging and vagueness
  • Human and LLM judges consistently prefer L-MARS answers for factual grounding and reasoning quality, despite higher latency (55.7s vs 1-4s)
Breakthrough Assessment
8/10
Strong practical application of agentic workflows to a high-stakes domain. While the core agentic patterns (looping, reflection) are known, the integration with specific legal backends and the rigorous sufficiency checking make it a robust blueprint for vertical AI.
×