OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

📝 Paper Summary

Reviewer Recommendation Academic Graph Mining

OmniReview integrates multi-source academic data to create a verified peer-review benchmark and proposes a multi-task framework (Pro-MMoE) to balance reviewer recall, expertise discrimination, and ranking.

Core Problem

Existing reviewer recommendation datasets lack comprehensive scholar profiles, suffer from biased candidate labels (artificial annotation or restricted pools), and use simplistic metrics that fail to filter unqualified candidates.

Why it matters:

Editorial workflows struggle to match growing submissions with qualified experts due to fragmented data
Current metrics (standard retrieval) suffer from false negative bias, penalizing valid but unassigned experts
Existing systems fail to distinguish between true experts and 'hard negatives' who share keywords but lack deep domain expertise

Concrete Example: A researcher with strong credentials in a broad field (e.g., 'Machine Learning') might be recommended for a specific sub-field paper (e.g., 'Molecular Dynamics') due to keyword overlap, despite having zero publications in that specific niche.

Key Novelty

OmniReview Benchmark & Pro-MMoE Framework

Constructs a massive dataset (202k+ reviews) by aligning OAG, Frontiers, and ORCID data via a multi-step entity disambiguation pipeline
Defines a three-tier evaluation hierarchy: Recall (finding experts), Discrimination (rejecting superficial matches), and Ranking (ordering valid candidates)
Proposes Pro-MMoE: Synergizes LLM-generated semantic profiles for interpretability with a Task-Adaptive Mixture-of-Experts architecture to balance conflicting evaluation goals

Architecture

Illustrates the limitations of existing datasets (Insufficient Profiling, Biased Labels, Simplistic Metrics) versus the OmniReview approach.

Evaluation Highlights

Dataset comprises 202,756 verified review records and 150,287 identified reviewers
Pro-MMoE improves Task 3 (Ranking) by +17.15% over state-of-the-art baselines [Note: Absolute values not in text]
Pro-MMoE improves Task 2 (Discrimination) by +5.39% over state-of-the-art baselines [Note: Absolute values not in text]

Breakthrough Assessment

8/10

Significant contribution to infrastructure (large-scale verified dataset) and methodology (addressing the hard-negative/discrimination problem often ignored in retrieval tasks).

⚙️ Technical Details

Problem Definition

Setting: Given a paper P and a set of scholars A, identify the subset of scholars R who are qualified to review P.

Inputs: Paper metadata (title, abstract), Scholar profiles (publication history, co-authors)

Outputs: Ordered list of recommended reviewers with confidence scores

Pipeline Flow

Entity Alignment (Data Cleaning -> Publication Matching -> Scholar Matching -> Verification)
Taxonomy Construction (Hierarchical clustering via Qwen3 embeddings)
Pro-MMoE Inference (LLM Profiling -> Task-Adaptive MMoE -> Multi-objective Output)

System Modules

Entity Alignment Pipeline

Integrate fragmented data from OAG, Frontiers, and ORCID into verified profiles

Model or implementation: Algorithm-based (Word-level title matching, Co-author intersection)

LLM Profiler (Pro-MMoE Inference)

Generate semantic profiles to preserve fine-grained expertise nuances

Model or implementation: LLM (Specific model for profiling not detailed in text)

Task-Adaptive MMoE (Pro-MMoE Inference)

Dynamically balance conflicting goals (Recall vs. Discrimination)

Model or implementation: Multi-gate Mixture-of-Experts

Novel Architectural Elements

Task-Adaptive MMoE architecture specifically designed to balance the trade-off between retrieving broad candidates (Recall) and filtering for absolute expertise (Discrimination) within a unified framework

Modeling

Base Model: Qwen3-Embedding-4B (used for Taxonomy construction)

Training Method: Multi-task Learning (MMoE)

Training Data:

202,756 verified review records
150,287 reviewers
Hierarchical discipline taxonomy used to generate dense relevance labels and hard negatives

Compute: Processing 109M publications and 34M profiles for the graph construction (Scale challenge noted)

Comparison to Prior Work

vs. FRONTIER-RevRec: OmniReview integrates external graphs (OAG/ORCID) for full scholar context, not just local platform data
vs. Embedding-based methods: Pro-MMoE uses LLM profiles to avoid 'semantic compression' and 'oversmoothing' of fine-grained sub-fields
vs. Standard Retrieval: Introduces 'Discrimination' task to specifically penalize hard-negatives (experts in adjacent fields but not the target field)

Limitations

Dataset construction relies on heuristic matching algorithms which may still contain edge-case errors
Evaluation requires complex hierarchical taxonomy construction which is computationally intensive
Full experimental results (absolute numbers) and Pro-MMoE architectural details are not fully visible in the provided text snippet

Reproducibility

Code: https://sites.google.com/view/omnireview-dataset

Project page available at https://sites.google.com/view/omnireview-dataset. Dataset construction algorithms (Title Matching, Scholar Matching) are explicitly detailed in the paper. Specific hyperparameters for Pro-MMoE training are not in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Reviewer Recommendation across three distinct tasks

Benchmarks:

OmniReview (Reviewer Recommendation) [New]

Metrics:

Task 1: Recall (Retrieving historical ground-truth)
Task 2: Discrimination (Filtering hard-negatives)
Task 3: Ranking (Fine-grained ranking)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided text contains the Introduction and Dataset Construction sections but cuts off before the Experiments section. Therefore, absolute numeric values for baselines and results are unavailable. The following relative improvements are extracted from the Introduction.

Main Takeaways

Pro-MMoE achieves state-of-the-art performance across 6 of 7 metrics on the OmniReview benchmark.
The method improves Task 3 (Ranking) by a significant margin (+17.15%), suggesting LLM profiles help distinguish fine-grained expertise better than static embeddings.
The method improves Task 2 (Discrimination) by +5.39%, indicating better handling of hard-negatives (candidates with superficial keyword matches but no deep expertise).
Hierarchical taxonomy allows for the generation of 'dense relevance labels', overcoming the sparsity of historical assignment data.

📚 Prerequisite Knowledge

Prerequisites

Knowledge of academic graphs (OAG, ORCID)
Understanding of Recommender Systems (Recall vs. Precision)
Familiarity with Multi-task Learning architectures

Key Terms

MMoE: Multi-gate Mixture-of-Experts—a neural architecture that uses multiple expert networks and gating mechanisms to learn shared and task-specific information simultaneously

Entity Alignment: The process of identifying and linking records that refer to the same real-world entity (e.g., researcher) across different databases

OAG: Open Academic Graph—a large-scale knowledge graph linking billions of academic entities (papers, authors, venues)

Hard-negatives: Candidates that appear relevant superficially (e.g., share keywords/subject) but are actually incorrect (lack specific expertise), making them difficult for models to filter

Qwen3-Embedding-4B: A specific pre-trained language model used in this paper to generate semantic vector representations of text