Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation

📝 Paper Summary

LLM-based Recommendation (LLM4Rec) Efficient Inference

L2D accelerates LLM-based recommendation by replacing slow text generation with fast vector matching in the model's internal latent space, preserving performance while reducing latency.

Core Problem

Fine-tuned LLMs for recommendation suffer from high latency due to autoregressive decoding, where generating a list of item titles requires sequentially predicting tokens one by one.

Why it matters:

Each recommendation request typically requires generating a list of items, causing costs to scale linearly with list size
Autoregressive generation waits for all preceding tokens, making real-time deployment of LLM recommenders computationally prohibitive
Existing grounding techniques that map one generated item to multiple real items can cause up to a 50% performance drop

Concrete Example: Generating a list of 10 items for a user requires the LLM to sequentially output hundreds of tokens (titles). L2D avoids this by outputting a single vector representation that is instantly matched against candidate items.

Key Novelty

Light Latent-space Decoding (L2D)

Treats the LLM's final hidden state as a 'thought' representation of the user's preferred item, bypassing the need to translate this thought into text tokens
Pre-computes vector representations for all candidate items by aggregating the hidden states of training examples where those items were the ground truth
Decodes recommendations by simply finding the nearest pre-computed item vectors to the test user's current hidden state vector

Architecture

The L2D framework process: Memory Construction, Representation Generation, and Item Decoding.

Evaluation Highlights

Reduces inference latency by >10x compared to standard language-space decoding (beam search) while maintaining comparable accuracy
Outperforms efficient LLM-based embedding baseline AlphaRec by reducing costs by at least 5x while achieving better Recall and NDCG
Achieves higher Recall@20 than standard decoding (beam=1) on Amazon Games (+11.2% relative improvement for ID-based classifier variant)

Breakthrough Assessment

7/10

Significant efficiency gain (>10x) for LLM recommenders with a simple, effective method. Successfully bridges generative training with discriminative inference.

⚙️ Technical Details

Problem Definition

Setting: Next-item prediction given textual user interaction history

Inputs: User interaction history sequence s_j (text format)

Outputs: Ranked list of Top-K items

Pipeline Flow

LLM Encoder (computes test sample hidden state)
Memory Aggregator (constructs candidate item vectors from training states)
Similarity Matcher (computes L2 distance between test state and candidate vectors)

System Modules

LLM Encoder

Encodes the user history prompt into a final hidden state

Model or implementation: Llama 3.2-1B (fine-tuned)

Memory Aggregator

Retrieves pre-stored hidden states of training samples associated with candidate items and aggregates them

Model or implementation: Non-parametric aggregation (Global or Local)

Similarity Matcher

Scores candidate items based on distance to the test hidden state

Model or implementation: L2 Distance

Novel Architectural Elements

Bypassing the language modeling head entirely during inference to perform decoding via vector matching in the latent space
Dual aggregation strategies (Global vs. Local) to construct item representations from training sample hidden states

Modeling

Base Model: Llama 3.2-1B

Training Method: Supervised Fine-Tuning (SFT) on recommendation data

Objective Functions:

Purpose: Standard next-token prediction for generative recommendation.

Formally: Autoregressive language modeling loss.

Adaptation: Full fine-tuning (implied, as LoRA not explicitly mentioned for main results)

Training Data:

Amazon CDs and Amazon Games datasets
Review sequences truncated to length 10
Converted to instruction format: 'A user has interacted with... which item would the user like next?'

Key Hyperparameters:

M (Local Aggregation neighbors): Varies (e.g., 100)
Sequence Length: Max 10 interactions

Compute: Inference is >10x faster than beam search (beam=10) and ~5x faster than AlphaRec. Storage for hidden states is ~2TB for 10^9 samples (manageable with sampling).

Comparison to Prior Work

vs. AlphaRec: L2D uses a single generative model's internal states rather than training separate encoders
vs. BIGRec/D3: L2D replaces the slow token-by-token generation with vector matching
vs. TALLRec: L2D focuses on efficient inference via latent decoding rather than just tuning effectiveness [not cited in paper]

Limitations

Memory overhead: Requires storing hidden states for training samples (mitigated by sampling)
Sparse item handling: Local aggregation strategy struggles with items that have few training examples
Model scale: Evaluated only on relatively small Llama 3.2-1B; scalability to larger models not explicitly tested

Reproducibility

Code URL not provided. Datasets (Amazon CDs, Amazon Games) are public. Method relies on storing hidden states from training data, which requires substantial memory but can be mitigated by reservoir sampling (30% retention shown to be effective).

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on Amazon datasets

Benchmarks:

Amazon CDs (Sequential Recommendation)
Amazon Games (Sequential Recommendation)

Metrics:

Recall@K (R@K)
NDCG@K (N@K)
Inference Latency (ms/user)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main performance comparison showing L2D variants outperforming generative baselines (with beam search=1) and achieving comparable results to expensive beam search.
Amazon Games	Recall@20	0.0651	0.0760	+0.0109
Amazon CDs	Recall@20	0.0881	0.1009	+0.0128
Amazon CDs	NDCG@20	0.0535	0.0593	+0.0058
Ablation study on aggregation strategies shows different strengths for sparse vs. dense items.
Amazon Games	Recall@20 (Dense Items)	0.060	0.085	+0.025

Experiment Figures

Scatter plot of Recommendation Performance (Recall@20) vs. Inference Time Cost.

Performance comparison of Global vs. Local aggregation on Sparse vs. Dense item groups.

Main Takeaways

L2D achieves >10x speedup over autoregressive decoding while maintaining competitive recommendation accuracy.
L2D-L (Local Aggregation) performs better on dense datasets/items by filtering irrelevant training samples.
L2D-G (Global Aggregation) is more robust for sparse items where training data is limited.
Using reservoir sampling to store only 30% of training hidden states still yields performance competitive with SASRec.

📚 Prerequisite Knowledge

Prerequisites

Generative Recommendation with LLMs
Autoregressive Decoding vs. Non-autoregressive
Vector Similarity Search

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

L2D: Light Latent-space Decoding—the proposed method that decodes items by matching hidden states rather than generating text

Autoregressive decoding: Generating text one token at a time, where each new token depends on all previously generated tokens

Language-space decoding: The standard LLM process of generating output as natural language text (e.g., item titles)

Latent space: The internal high-dimensional vector space of the LLM where input text is represented as numerical embeddings

Hidden state: The vector representation of the input at the final layer of the LLM, before the classification head

Reservoir sampling: A randomized algorithm to choose a simple random sample of k items from a list of n items, where n is either a very large or unknown number

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

Beam search: A search algorithm that explores multiple promising paths (beams) simultaneously to find the most likely sequence of tokens

SFT: Supervised Fine-Tuning—training the LLM on labeled (prompt, ground-truth item) pairs