Accelerating Listwise Reranking: Reproducing and Enhancing FIRST

📝 Paper Summary

Listwise Reranking Efficient Retrieval

FIRST accelerates listwise reranking by determining document order solely from the first generated token's logits, achieving 40% latency reduction while maintaining effectiveness across diverse backbones and datasets.

Core Problem

Traditional listwise reranking using LLMs is slow because it requires auto-regressive generation of complete document identifier permutations.

Why it matters:

High inference latency makes deploying powerful LLM rerankers prohibitive for real-time search applications
Standard language modeling objectives uniformly penalize errors at all positions, failing to prioritize top-ranked documents crucial for retrieval effectiveness
Existing efficient methods often sacrifice ranking quality for speed

Concrete Example: A traditional listwise reranker typically generates a sequence like '2 > 1 > 4 > 3' token-by-token. If the LLM is large, generating these multiple tokens creates a bottleneck. FIRST avoids this by looking only at the logits of the very first token generated to infer the full ranking immediately.

Key Novelty

Single-Token Listwise Reranking via Logits

Instead of generating a text sequence of document IDs (e.g., 'A > B > C'), the model is trained to output a single token whose logits represent the relevance scores of all candidate documents simultaneously
Combines a specific learning-to-rank loss (focusing on pairwise errors) with standard language modeling loss to align the model's first-token probability distribution with the ground truth ranking

Evaluation Highlights

Achieved ~40% reduction in inference latency compared to full-generation RankZephyr/RankMistral on TREC Deep Learning datasets
FirstMistral (FIRST on Mistral-v0.3) surpassed the original FIRST-Reddy implementation on 8 out of 11 BEIR datasets
Demonstrated robust out-of-domain generalization on TREC DL19–23, with FirstMistral (0.7209 nDCG@10) matching full-generation RankZephyr (0.7166 nDCG@10)

Breakthrough Assessment

7/10

Solid reproduction and extension of existing work (FIRST). Validates efficiency gains and generalizes to new backbones, but identifies tokenization issues and negative interference from LM pre-training.

⚙️ Technical Details

Problem Definition

Setting: Listwise document reranking using a sliding window approach

Inputs: Query q and a list of candidate documents R = {d1, ..., dn}

Outputs: Reordered list R based on relevance to q

Pipeline Flow

First-stage Retriever (fetches top-k documents)
Sliding Window Processor (chunks documents into windows)
LLM Reranker (processes window, outputs first-token logits)
Rank Aggregator (combines window scores into final list)

System Modules

First-stage Retriever

Fetch initial candidate documents

Model or implementation: Contriever, BM25, SPLADE++, or RepLLaMA

FIRST Reranker

Score documents within a window using single-token logits

Model or implementation: Fine-tuned Zephyr beta, Mistral-v0.3, or LLaMA-3.1-8B

Novel Architectural Elements

Inference mechanism relies strictly on the logits of the first generated token (single-token decoding) rather than sequential text generation

Modeling

Base Model: Evaluated multiple: Zephyr beta, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct

Training Method: Fine-tuning with joint loss objective

Objective Functions:

Purpose: Ensure the model generates valid identifier tokens.

Formally: Standard language modeling loss (L_LM)
Purpose: Prioritize correct ordering of top documents.

Formally: Weighted pairwise learning-to-rank loss L_Rank = sum(1/(i+j) * log(1 + exp(p_i - p_j))) for r_i < r_j

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (7B/8B scale)

Training Data:

40K GPT-4 labeled rerank instances (from RankZephyr)
Converted to use alphabetical identifiers

Key Hyperparameters:

lambda: 10 (weight for ranking loss)
effective_batch_size: 32
learning_rate: 5e-6
+ 3 more
epochs: 3
window_size: 20
step_size: 10

Compute: Training: 4 NVIDIA RTX A6000s. Inference: Tested on single NVIDIA RTX 4090.

Comparison to Prior Work

vs. RankZephyr: Uses single-token logits instead of full permutation generation; includes explicit ranking loss
vs. RankGPT: Fine-tuned local model (7B) vs. API-based zero-shot model
vs. Setwise Reranking [not cited in paper]: FIRST maintains listwise context but uses single-token decoding, whereas Setwise approaches typically aggregate pairwise/pointwise scores or use heap-sort-like mechanisms

Limitations

Alphabetical identifiers can suffer from tokenization inconsistencies (e.g., 'A' vs ' A'), requiring post-processing
Language modeling pre-training (L_LM) may hinder subsequent fine-tuning on the FIRST objective (FirstRankZephyr < FirstZephyr)
Effectiveness varies significantly by backbone (FirstLLaMA performed poorly compared to Mistral variants)
Diminishing returns when paired with strong first-stage retrievers

Reproducibility

Code: https://rankllm.ai

Code available at rankllm.ai. Training data available (40K GPT-4 instances). Original FIRST checkpoint available on HuggingFace. Tokenization issue noted: alphabetical identifiers sometimes generated with preceding whitespace, requiring post-processing filtering.

📊 Experiments & Results

Evaluation Setup

Reranking top-100 documents retrieved by first-stage retriever

Benchmarks:

BEIR (Diverse retrieval tasks (Climate-FEVER, DBPedia, etc.))
TREC Deep Learning (DL19-23) (Passage ranking)

Metrics:

nDCG@10
Latency (ms/query)
Statistical methodology: Paired Student’s t-test with p<=0.01 with Bonferroni correction

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generalization across diverse datasets (BEIR/MS MARCO) shows FirstMistral often outperforming the original FIRST implementation.
FiQA (BEIR)	nDCG@10	0.4223	0.4778	+0.0555
TREC-COVID (BEIR)	nDCG@10	0.7913	0.7666	-0.0247
Evaluation on out-of-domain TREC DL datasets confirms FIRST effectiveness is competitive with full-generation models.
TREC DL19-23 (Average)	nDCG@10	0.7166	0.7209	+0.0043
Latency experiments quantify the efficiency gains of single-token decoding.
TREC DL20	Latency (s/query)	3.06	1.79	-1.27
TREC DL20	Output Tokens	711	18	-693

Experiment Figures

Training loss curves for FirstMistral vs FirstLLaMA

Main Takeaways

FIRST generalizes well to Mistral-v0.3 (FirstMistral), often outperforming the original Zephyr-based implementation
Models trained solely on L_LM (RankZephyr/RankVicuna) show strong zero-shot ability to perform single-token reranking, validating that L_LM implicitly learns ranking signals
Counter-intuitively, fine-tuning an already ranking-tuned model (RankZephyr) on the FIRST objective performs worse than fine-tuning from the base model (Zephyr), suggesting interference between objectives
Efficiency gains (~40% latency reduction) are consistent across datasets and backbones
Retriever quality impacts FIRST similarly to traditional rerankers: better retrieval helps, but with diminishing returns

📚 Prerequisite Knowledge

Prerequisites

Knowledge of multi-stage retrieval (retriever + reranker)
Understanding of LLM generation (logits, tokens)
Familiarity with ranking metrics (nDCG)

Key Terms

Listwise Reranking: A ranking approach where the model considers multiple documents simultaneously to produce an ordered list, rather than scoring each document independently

Logits: The raw, unnormalized prediction scores generated by the final layer of a neural network before applying softmax

nDCG@10: Normalized Discounted Cumulative Gain at 10—a measure of ranking quality that considers the position of relevant items, focusing on the top 10 results

Zephyr beta: A specific instruction-tuned version of the Mistral-7B language model

BM25: A probabilistic retrieval function based on term frequency and inverse document frequency

SPLADE++: A sparse neural retrieval model that learns sparse term weights for documents and queries

RepLLaMA: A dense retrieval model based on the LLaMA architecture

Sliding Window: A technique to handle long lists by processing a fixed-size subset of documents at a time and moving the window with a specific step size