Xuewen Han, Neng Wang, Shangkun Che, Hongyang Yang, Kunpeng Zhang, Sean Xin Xu
Tsinghua University,
AIFinance Foundation,
University of Maryland, College Park
International Conference on AI in Finance
(2024)
AgentRAGBenchmark
📝 Paper Summary
Multi-agentAgentic RAG pipeline
This paper proposes a multi-agent system with configurable collaboration structures (vertical, horizontal, hybrid) for financial research, demonstrating that complex tasks like risk analysis benefit from agent ensembles while simple tasks favor single agents.
Core Problem
Existing financial AI tools typically rely on single-agent systems that fail to leverage collaborative intelligence, while standard multi-agent debate methods are impractical for complex, structured corporate workflows.
Why it matters:
Financial decision-making requires integrating diverse perspectives (risk, sentiment, fundamentals), which single models struggle to balance
Applying unstructured multi-agent debates (MAD) to large groups is inefficient and lacks the clear role definitions needed for rigorous investment research
There is a lack of empirical validation regarding which agent topology (hierarchy vs. flat team) works best for specific financial sub-tasks
Concrete Example:In a risk analysis task, a single agent might overlook a subtle liability in a 10-K form because it is overwhelmed by the context. A 'Vertical' multi-agent group assigns a leader to direct a subordinate specifically to 'analyze liquidity risks,' ensuring deeper coverage.
Key Novelty
Configurable Multi-Agent Collaboration Topologies for Finance
Defines three distinct collaboration structures (Horizontal, Vertical, Hybrid) that dictate how agents share information and authority, moving beyond simple 'more agents is better' logic
Implements a 'Vertical' structure via a nested chat mechanism where leaders issue hidden commands to subordinates, simulating corporate hierarchy within LLM interactions
Treats RAG as a unified tool function callable by agents, allowing them to autonomously refine query parameters rather than relying on fixed retrieval settings
Architecture
Overview of the agent structures (Single, Dual, Vertical, Horizontal, Hybrid) and the unified RAG/Tool calling mechanism.
Evaluation Highlights
Ensemble multi-agent structure achieves 66.7% accuracy in 'buy or not' investment decisions on Dow Jones stocks
Achieves a low 2.35% average difference in one-week target price predictions using the optimal agent configuration
Demonstrates that single agents actually outperform multi-agent groups on simpler tasks like fundamental and sentiment analysis
Breakthrough Assessment
7/10
Provides a practical, empirically grounded framework for structuring multi-agent teams in finance. While the underlying models are standard (GPT-4), the structural analysis of agent collaboration typologies is valuable.
⚙️ Technical Details
Problem Definition
Setting: AI-powered investment research analyzing SEC 10-K forms to predict stock movements and make investment recommendations
Inputs: Company ticker symbol and 2023 SEC 10-K form (PDF converted to text)
Outputs: Investment decision (Buy/Not Buy) and Target Price (1-week forecast)
Pipeline Flow
User Input (Company Ticker)
Leader Agent (Planning & Coordination)
Sub-Agents (Execution via Tools)
Tools (RAG, YFinance, Reddit API)
Final Report Generation
System Modules
Leader Agent
Global planning, task delegation, and final synthesis of reports
Model or implementation: GPT-4-1106-vision-preview
Analyst Agents
Execute specific analyses (Fundamentals, Sentiment, Risk) using tools
Model or implementation: GPT-4-1106-vision-preview
RAG Tool
Retrieve context from 10-K filings
Model or implementation: all-MiniLM-L6-v2 (Embedding model)
Novel Architectural Elements
Nested chat mechanism for Vertical Collaboration: Leader output triggers a separate, isolated chat loop with a specific subordinate that is invisible to others
Hybrid Collaboration structure: Maintains leader authority for final decisions but allows shared communication among subordinates
Modeling
Base Model: GPT-4-1106-vision-preview (OpenAI API)
Training Method: Inference-only prompting and orchestration
Compute: Not reported in the paper (relies on external API)
Comparison to Prior Work
vs. StockAgent/FinAgent: Uses structured multi-agent collaboration (Vertical/Hybrid) rather than single-agent workflows
vs. MAD: Focuses on structured role-based collaboration (leader-subordinate) rather than unstructured debate/consensus mechanisms
Limitations
Relies on closed-source GPT-4 API, making costs high and reproducibility dependent on OpenAI
Analysis limited to 30 Dow Jones companies and 2023 10-K forms
Performance on simple tasks (Fundamentals/Sentiment) degrades with multi-agent complexity compared to single agents
Code is publicly available at https://github.com/AI4Finance-Foundation/FinRobot. The paper relies on closed-source models (GPT-4) and third-party APIs (FMP, FinnHub, Reddit), meaning exact replication depends on API access and version stability.
📊 Experiments & Results
Evaluation Setup
Financial analysis of 30 Dow Jones companies using 2023 annual reports
Benchmarks:
Dow Jones 30 Analysis (Investment Research (Real-world data)) [New]
Metrics:
Target Price Prediction (Average Difference %)
Buy/Not Buy Accuracy
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Dow Jones 30 Analysis
Target Price Prediction Avg Diff
Not reported in the paper
2.35%
Not reported in the paper
Dow Jones 30 Analysis
Buy/Sell Accuracy
Not reported in the paper
66.7%
Not reported in the paper
Main Takeaways
Task complexity dictates optimal agent structure: Simple tasks (Fundamentals, Sentiment) are best handled by Single Agents, while complex tasks (Risk Analysis) require Multi-Agent groups.
The 'Ensemble' structure, which combines different agent groups, outperforms individual configurations, achieving 66.7% accuracy in decision making.
Vertical collaboration (strict hierarchy) optimizes efficiency for execution-heavy tasks, whereas Horizontal collaboration (shared chat) promotes better information exchange for simple cooperative tasks.
Increasing agent count does not strictly correlate with performance; for simpler tasks, adding agents introduces noise and reduces efficiency.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Large Language Models (LLMs) and function calling
Familiarity with RAG (Retrieval-Augmented Generation)