PaSa: An LLM Agent for Comprehensive Academic Paper Search

📝 Paper Summary

Autonomous Research Agents Information Retrieval

PaSa is a reinforcement-learning-optimized agent that autonomously searches, reads, and navigates citation networks to perform comprehensive literature surveys, significantly outperforming keyword-based search engines.

Core Problem

Standard academic search tools (e.g., Google Scholar) excel at keyword matching but fail at complex, survey-level queries that require navigating citation networks and filtering for specific methodology.

Why it matters:

Researchers spend substantial time manually conducting literature surveys to ensure comprehensive coverage
Keyword-based search engines often miss relevant papers that don't share exact terminology but are conceptually related (long-tail knowledge)
Existing LLM search tools typically just rephrase queries rather than engaging in deep, multi-step research behaviors like reading papers and checking references

Concrete Example: For the query 'Which studies have focused on non-stationary reinforcement learning using value-based methods, specifically UCB-based algorithms?', a standard search might miss papers that don't explicitly mention 'non-stationary' in the title but discuss it in the body, whereas PaSa follows citations from seed papers to find them.

Key Novelty

Dual-Agent Architecture with Session-Level RL

Decomposes search into a 'Crawler' agent (maximizes recall via search and citation expansion) and a 'Selector' agent (maximizes precision via content reading)
Optimizes the Crawler using a novel session-level PPO (Proximal Policy Optimization) that breaks long search trajectories into manageable segments
Uses the Selector as an auxiliary reward model during training to solve the 'sparse reward' problem inherent in citation-based ground truth

Architecture

The overall workflow of PaSa, illustrating the interaction between the User, Crawler, Paper Queue, and Selector.

Evaluation Highlights

+37.78% improvement in Recall@20 on RealScholarQuery (real-world dataset) compared to Google search enhanced with GPT-4o
PaSa-7B outperforms PaSa-GPT-4o (the same agent architecture powered by GPT-4o prompting) by 30.36% in recall on real-world queries
Achieves ~94% qualification rate on the synthetic training dataset AutoScholarQuery, verified by human review

Breakthrough Assessment

8/10

Strong practical contribution. Demonstrates that a 7B model trained with specialized RL significantly beats GPT-4o and Google on complex research tasks. The session-level PPO and synthetic data pipeline are highly reusable techniques.

⚙️ Technical Details

Problem Definition

Setting: Given a complex scholar query q, retrieve the comprehensive set of relevant papers P via autonomous tool use.

Inputs: Natural language research query q

Outputs: A list of relevant papers (titles, abstracts, metadata)

Pipeline Flow

User Query → Crawler Agent
Crawler Loop: Search/Expand → Paper Queue
Selector Loop: Read Paper → Relevance Decision
Filtered Papers → Final Output

System Modules

Crawler

Maximize recall by finding potentially relevant papers via search or citations

Model or implementation: PaSa-7B (7B parameter LLM)

Selector

Maximize precision by reading papers to verify if they meet query requirements

Model or implementation: PaSa-7B (7B parameter LLM)

Novel Architectural Elements

Session-based trajectory segmentation for RL: splitting the potentially infinite search process into 'sessions' delimited by [Stop] actions to enable feasible PPO training
Dual-role Selector: Functions as both the final filter for the user AND the dense reward signal provider for the Crawler during training

Modeling

Base Model: 7B parameter LLM (specific base model like Mistral/Llama not explicitly named in snippet, referred to as PaSa-7B)

Training Method: Reinforcement Learning (Session-level PPO) following Imitation Learning (SFT)

Objective Functions:

Purpose: Maximize expected reward from finding relevant papers.

Formally: Policy gradient loss with advantage estimation.
Purpose: Estimate value of states for advantage computation.

Formally: Value function loss (MSE between predicted value and returns).
Purpose: Prevent model from deviating too far from SFT initialization.

Formally: KL divergence penalty scaled by beta.

Training Data:

AutoScholarQuery: 33,551 training pairs generated by GPT-4o from Related Work sections of ICLR, ICML, NeurIPS, ACL, CVPR papers
RealScholarQuery: 50 manually curated and annotated real-world queries for testing

Key Hyperparameters:

reward_coefficient_alpha: Parameter scaling the reward function
kl_penalty_beta: Parameter scaling the KL divergence penalty
value_coefficient_eta: Parameter scaling the value loss

Compute: Not reported in the paper

Comparison to Prior Work

vs. Google/Scholar: PaSa actively reads content and follows citations rather than just matching keywords
vs. ChatGPT: PaSa is specialized for academic structure (citations, related work) and trained via RL for this specific workflow
vs. PaSa-GPT-4o: PaSa-7B (trained) outperforms the prompt-based GPT-4o version, showing the value of the RL fine-tuning over just prompt engineering

Limitations

Computational cost of reading hundreds of papers during inference is likely high (though not explicitly quantified)
The 'sparse reward' issue in training relies on the Selector being accurate; if the Selector fails, the Crawler learns incorrectly
Real-world evaluation is limited to 50 queries due to the high cost of expert annotation ($304 per query)

Reproducibility

Code: https://github.com/bytedance/pasa

Code and models are stated to be available at https://github.com/bytedance/pasa. The synthetic dataset AutoScholarQuery and benchmark RealScholarQuery are also released.

📊 Experiments & Results

Evaluation Setup

Retrieval of relevant academic papers for complex queries

Benchmarks:

AutoScholarQuery (Test Set) (Synthetic academic retrieval) [New]
RealScholarQuery (Real-world academic retrieval (human annotated)) [New]

Metrics:

Recall@20
Recall@50
Precision
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on RealScholarQuery (real-world queries) showing PaSa-7B's dominance over search engines and larger models.
RealScholarQuery	Recall@20	See delta	See delta	+37.78%
RealScholarQuery	Recall@50	See delta	See delta	+39.90%
RealScholarQuery	Recall	See delta	See delta	+30.36%
Performance on AutoScholarQuery (synthetic test set) confirming the trends seen in real-world data.
AutoScholarQuery Test	Recall@20	See delta	See delta	+34.05%
AutoScholarQuery Test	Recall@50	See delta	See delta	+39.36%

Main Takeaways

Specialized RL training allows a smaller model (7B) to significantly outperform a much larger generalist model (GPT-4o) on agentic tasks.
The 'Crawler + Selector' architecture effectively balances recall (via citation mining) and precision (via reading), mimicking human researcher behavior.
Synthetic data derived from 'Related Work' sections is a viable and high-quality source for training complex information retrieval agents.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (MDP, Policy Gradient)
Language Models and Agentic workflows (Tool use)
Information Retrieval concepts (Recall, Precision)

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that improves agent stability by limiting how much the policy can change in one step

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

Recall@K: The proportion of relevant items retrieved within the top K results

SFT: Supervised Fine-Tuning—training the model on examples of correct behavior before applying reinforcement learning

AutoScholarQuery: A synthetic dataset created by the authors containing 35k query-paper pairs derived from Related Work sections of top AI papers

Session-level PPO: A modification of PPO introduced here that updates the model based on shorter sub-trajectories (sessions) rather than waiting for the full, extremely long search process to finish