DemoRank: Selecting Effective Demonstrations for Large Language Models in Ranking Task

📝 Paper Summary

In-Context Learning (ICL) Passage Ranking

DemoRank improves few-shot passage ranking by retrieving candidate demonstrations and then reranking them using a dependency-aware reranker trained with a novel list-pairwise approach.

Core Problem

Existing demonstration selection methods retrieve examples independently based on relevance, ignoring the critical dependencies between demonstrations in the prompt sequence.

Why it matters:

In passage ranking, combining diverse demonstrations (e.g., opposite labels, distinct queries) often helps the LLM understand relevance better than just stacking high-confidence examples.
Choosing demonstrations independently leads to redundancy and suboptimal k-shot prompts.
Identifying the optimal permutation of k demonstrations is an NP-hard problem, making it difficult to generate training data for selection models.

Concrete Example: When ranking a relevant query-passage pair, standard retrievers might select two positive demonstrations with similar queries. However, pairing one positive example with a negative example (having opposite outputs) provides richer signals about the decision boundary, which independent selection misses.

Key Novelty

Dependency-Aware Demonstration Reranking (DemoRank)

Transforms demonstration selection into a 'retrieve-then-rerank' pipeline: first retrieve candidates, then iteratively rerank them to build a sequence.
Approximates the optimal demonstration sequence via an iterative search (scoring lists that differ only in the last item) to create training data efficiently.
Uses a list-pairwise training objective where the model learns to choose the best 'next' demonstration given the context of previously selected ones.

Architecture

The DemoRank framework pipeline, illustrating the two-stage process: Demonstration Retrieval followed by Dependency-Aware Reranking.

Evaluation Highlights

Outperforms state-of-the-art baselines like EPR and UDR on MS MARCO and TREC benchmarks, achieving 75.33 NDCG@10 on MS MARCO (dev).
Achieves significant gains in few-shot settings; for example, on TREC-DL 2019, DemoRank reaches 74.00 NDCG@10 compared to 71.85 for UDR.
Demonstrates strong transferability across different LLM rankers (e.g., testing on LLaMA-2-13B with a reranker trained for LLaMA-2-7B).

Breakthrough Assessment

7/10

Novel formulation of demonstration selection as a sequential reranking problem with a clever approximation for generating training data. Strong empirical results on standard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Few-shot passage ranking using Large Language Models (LLMs) via In-Context Learning (ICL).

Inputs: A query q, a candidate passage p, and a pool of potential demonstrations P.

Outputs: A sequence of k demonstrations [z_1, ..., z_k] prepended to the input to maximize the LLM's ranking accuracy.

Pipeline Flow

Demonstration Retrieval (DRetriever)
Demonstration Reranking (DReranker)

System Modules

DRetriever

Retrieve a candidate set of potentially useful demonstrations from the global pool.

Model or implementation: BERT-base-uncased (Bi-encoder)

DReranker

Iteratively select the next demonstration from the candidate set given the previously selected sequence.

Model or implementation: BERT-base-uncased (Cross-encoder)

Novel Architectural Elements

Iterative selection pipeline where the reranker inputs the concatenation of the test query-passage and the *entire* currently selected demonstration sequence to predict the next slot.

Modeling

Base Model: BERT-base-uncased for both DRetriever and DReranker

Training Method: Supervised learning with constructed dependency-aware samples

Objective Functions:

Purpose: Train the retriever to identify high-quality individual demonstrations.

Formally: Weighted sum of Contrastive Loss (L_c) and Ranking Loss (RankNet, L_r).
Purpose: Train the reranker to select the next demonstration given a context.

Formally: List-pairwise RankNet loss comparing lists l1 and l2 differing only in the last element.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of the BERT-base encoders

Training Data:

MS MARCO training set used to construct demonstration pool.
Iterative approximation algorithm used to generate ranked lists of demonstration sequences for DReranker training.

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 32
epochs: 5
+ 3 more
contrastive_loss_weight_lambda: 1.0
retrieved_candidates_b: 10
reranking_candidates_M: 100

Compute: Not reported in the paper

Comparison to Prior Work

vs. EPR/UDR: DemoRank adds a second reranking stage that explicitly models dependencies between demonstrations, whereas EPR/UDR select top-k independently.
vs. KATE/BM25: DemoRank uses task-specific supervision (LLM relevance scores) rather than just semantic or lexical similarity.
vs. standard ICL: Optimizes the *sequence* and *combination* of examples, not just individual relevance.

Limitations

Inference cost increases with the number of iterations in the reranking stage.
The greedy approximation for training data generation might not find the global optimal permutation.
Reliance on a specific LLM for generating supervision signals (though transferability is shown).

Reproducibility

Code availability is not provided. MS MARCO, TREC-DL, and BEIR datasets are public. The method for constructing training data (Algorithm 1) is described in detail.

📊 Experiments & Results

Evaluation Setup

Passage ranking on standard benchmarks using LLaMA-2-7B and LLaMA-2-13B as the backbone LLMs for relevance generation.

Benchmarks:

MS MARCO (Passage Ranking)
TREC-DL 2019 (Passage Ranking)
TREC-DL 2020 (Passage Ranking)
BEIR (NFCorpus, COVID, ArguAna, Touche-2020) (Zero-shot Retrieval / Transfer Learning)

Metrics:

NDCG@10
MAP
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on MS MARCO and TREC datasets showing DemoRank superiority over baselines using LLaMA-2-7B.
MS MARCO (Dev)	NDCG@10	74.72	75.33	+0.61
TREC-DL 2019	NDCG@10	71.85	74.00	+2.15
TREC-DL 2020	NDCG@10	69.19	70.93	+1.74
MS MARCO (Dev)	NDCG@10	74.72	75.33	+0.61
TREC-DL 2019	NDCG@10	73.28	74.65	+1.37

Experiment Figures

Performance (NDCG@10) on TREC-DL 2019/2020 with varying numbers of shots (k=1 to 5).

Main Takeaways

DemoRank consistently outperforms baseline retrievers (BM25, KATE, EPR, UDR) across in-domain (MS MARCO, TREC) and out-of-domain (BEIR) datasets.
The reranking module (DReranker) provides additive gains over the retrieval module (DRetriever), validating the importance of modeling demonstration dependencies.
The method shows strong transferability: a reranker trained with signals from a smaller model (LLaMA-2-7B) improves performance of a larger model (LLaMA-2-13B).
Performance improves as the number of demonstrations (shots) increases, up to a saturation point (around 3-5 shots).

📚 Prerequisite Knowledge

Prerequisites

In-Context Learning (ICL)
Information Retrieval (IR) metrics (NDCG, MAP)
Bi-encoder and Cross-encoder architectures
Contrastive Loss
Listwise vs. Pairwise ranking

Key Terms

Relevance Generation: A pointwise ranking method where an LLM is prompted to output 'Yes' or 'No' for a query-passage pair, using the probability of 'Yes' as the score.

Demonstration Reranker (DReranker): A cross-encoder model that takes a query, passage, and a sequence of already selected demonstrations to predict the next best demonstration.

List-pairwise training: A training method where the model compares two demonstration lists that differ only in their last element to learn the optimal sequential selection.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items.

Bi-encoder: A model architecture that encodes two inputs (e.g., query and document) separately into vectors and computes their similarity (usually dot product).

Cross-encoder: A model architecture that concatenates two inputs and processes them together through the network layers, allowing for full interaction between them.

RankNet: A pairwise learning-to-rank loss function that optimizes the probability that a relevant document is ranked higher than an irrelevant one.

NP-hard: A class of problems that are at least as hard as the hardest problems in NP; here, finding the optimal permutation of demonstrations is computationally intractable.