Zero-Shot Learning Over Large Output Spaces: Utilizing Indirect Knowledge Extraction from Large Language Models

📝 Paper Summary

Extreme Multi-label Text Classification (XMC) Zero-shot Learning

LMTX trains a lightweight bi-encoder for extreme classification by using a large language model to filter and identify high-quality pseudo-positive labels from a shortlist, eliminating LLM inference costs.

Core Problem

Existing Extreme Zero-shot XMC (EZ-XMC) methods rely on noisy pseudo-labels (like document titles or random segments) that misalign with target tasks, while direct LLM inference is too computationally expensive for large-scale tagging.

Why it matters:

Real-world systems (e.g., product tagging) face cold-start problems where new labels emerge dynamically without annotated data
Current lightweight methods use signals (like random spans) that lack semantic alignment with categorization tasks
Deploying massive LLMs for real-time inference on millions of labels is cost-prohibitive

Concrete Example: In previous methods like RTS, a document is split into two random segments, treating one as the 'label' for the other. These segments may be semantically unrelated if far apart, creating noisy training signals. LMTX instead asks an LLM: 'Is tag X relevant to document Y?', filtering out bad matches.

Key Novelty

Large Language Model as Teacher for eXtreme classification (LMTX)

Uses a curriculum-based iterative loop: a bi-encoder retrieves candidate labels, and an LLM acts as a 'teacher' to verify which candidates are actually relevant
Filters noisy candidates into high-quality pseudo-positives using the LLM's zero-shot reasoning capabilities ('Yes/No' relevance check)
Distills LLM knowledge into a lightweight bi-encoder, allowing the expensive teacher to be discarded during final inference

Architecture

The iterative training framework of LMTX involving three stages: Shortlist Generation, LLM Teacher Filtering, and Bi-encoder Training.

Evaluation Highlights

+31% improvement in Precision@1 on LF-Wikipedia-500K compared to state-of-the-art non-LLM baselines
+37% improvement in Precision@1 on AmazonCat-13K compared to state-of-the-art non-LLM baselines
Significantly outperforms direct LLM inference (ICXML) on EURLex-4K (P@1 47.28 vs 19.14) while being orders of magnitude faster

Breakthrough Assessment

8/10

Establishes a new state-of-the-art in zero-shot extreme classification by effectively bridging the gap between high-quality LLM reasoning and the efficiency of bi-encoders, addressing a critical scalability bottleneck.

⚙️ Technical Details

Problem Definition

Setting: Extreme Zero-Shot Multi-label Text Classification (EZ-XMC)

Inputs: Raw document text X_i and a predefined label set {l_k} (no annotated document-label pairs provided)

Outputs: A subset of relevant labels {l_j} for the document

Pipeline Flow

Bi-encoder (encodes document)
MIPS Index (retrieves top-m labels)

System Modules

Bi-encoder (Inference)

Generate embeddings for the input document

Model or implementation: DistilBERT-based transformer (fine-tuned via LMTX process)

MIPS Index (Inference)

Retrieve the most similar labels from the pre-computed label embedding space

Model or implementation: Faiss index

Novel Architectural Elements

Iterative Teacher-Student training loop where the teacher (LLM) dynamically filters the student's (bi-encoder) retrieval shortlist to create cleaner training data

Modeling

Base Model: DistilBERT (for bi-encoder)

Training Method: Iterative pseudo-label refinement and triplet loss minimization

Objective Functions:

Purpose: Minimize distance between document and relevant label embeddings while maximizing distance to negatives.

Formally: Triplet loss L = max(0, gamma + <E(X), E(l_n)> - <E(X), E(l_p)>)

Trainable Parameters: All parameters of the bi-encoder (DistilBERT)

Key Hyperparameters:

margin (gamma): Not explicitly reported in the paper
LLM models used: WizardLM-13B-V1.2, Llama-2-13b-chat-hf
Shortlist size (j): Not explicitly reported in the paper

Compute: Training time varies (e.g., ~16.5h for LF-Wikipedia-500K on 4 NVIDIA A100 GPUs)

Comparison to Prior Work

vs. MACLR/RTS: LMTX uses semantic verification via LLM instead of structural heuristics (titles/spans), resulting in higher quality pairs
vs. ICXML: LMTX uses LLM only during training to select labels, keeping inference lightweight (bi-encoder only), whereas ICXML requires expensive LLM calls at inference

Limitations

Depends on the quality of the LLM teacher; if the LLM hallucinates relevance, noise is introduced
Training time can be significant due to the iterative nature and LLM querying cost (though inference is fast)
Hard negative sampling using LLM 'No' responses degrades performance, contrary to expectations

Reproducibility

Code: https://github.com/xmc-aalto/LMTX

Code is publicly available at https://github.com/xmc-aalto/LMTX. Datasets are standard XMC benchmarks (EURLex-4k, Wiki10-31k, AmazonCat-13K, etc.). Specific prompt templates are discussed in Appendix A.6. Hyperparameters like margin gamma are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Zero-shot tagging where models predict labels for documents without seeing annotated training pairs

Benchmarks:

EURLex-4k (Legal document tagging)
Wiki10-31k (Wikipedia article tagging)
AmazonCat-13K (Product categorization)
LF-WikiSeeAlso-320K (Wikipedia related article recommendation)
LF-Wikipedia-500K (Wikipedia article tagging (large scale))

Metrics:

Precision@k (P@1, P@3, P@5)
Recall@m (R@10, R@50, R@100)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LMTX demonstrates superior performance over non-LLM baselines across multiple large-scale datasets, particularly in Precision@1.
LF-Wikipedia-500K	P@1	31.25	41.05	+9.80
AmazonCat-13K	P@1	63.95	87.89	+23.94
EURLex-4k	P@1	42.92	47.28	+4.36
LF-WikiSeeAlso-320K	P@1	26.36	26.56	+0.20
Comparison against LLM-based inference (ICXML) shows LMTX achieves higher precision with much lower inference cost.
EURLex-4k	P@1	19.14	47.28	+28.14

Experiment Figures

Impact of reducing training sample size on P@1 and training time for AmazonCat-13K.

Comparison of negative sampling strategies: In-batch negatives vs. LLM-derived Hard Negatives.

Main Takeaways

Teacher choice matters: Different LLMs (WizardLM vs Llama2) perform best on different datasets, highlighting LMTX's flexibility.
Sampling efficiency: LMTX can achieve competitive performance using only a subset of training documents, balancing training cost and accuracy.
Initialization robustness: LMTX outperforms baselines even when both start from the same initialization, proving the gain comes from the learning process/pseudo-labels.
Negative sampling nuance: Using LLM-rejected labels as 'hard negatives' actually hurt performance compared to using in-batch negatives.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Bi-encoder architectures (BERT-based)
Concept of Pseudo-labeling in semi-supervised/unsupervised learning
Knowledge of Approximate Nearest Neighbor Search (ANNS)

Key Terms

XMC: Extreme Multi-label Text Classification—assigning labels from a set of hundreds of thousands or millions

EZ-XMC: Extreme Zero-shot XMC—XMC where no annotated training data is available (only raw text and label names)

Bi-encoder: A model architecture that encodes documents and labels separately into the same vector space to compute similarity

ANNS: Approximate Nearest Neighbor Search—algorithms to efficiently find similar vectors in large datasets without exhaustive comparison

MIPS: Maximum Inner Product Search—a specific type of similarity search often used with dot-product or cosine similarity

Pseudo-labels: Labels generated automatically (e.g., by a model or heuristic) rather than by human annotators, used for training

Hard negatives: Incorrect labels that are very similar to the correct label or document, used to force the model to learn finer distinctions