Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

📝 Paper Summary

Model Compression Efficient Inference Recommendation Systems

This paper presents a three-stage pipeline (distillation, structured pruning, and re-distillation) to compress 100B+ parameter recommendation models into efficient small language models while maintaining ranking accuracy.

Core Problem

Massive language models (100B+ parameters) offer superior performance for recommendation tasks but incur prohibitively high latency and infrastructure costs for real-time serving.

Why it matters:

Real-time recommendation systems require extremely low latency (milliseconds) to rank hundreds of items per user request
Deploying 100B+ models at scale is economically impractical due to hardware requirements
Standard compression techniques often degrade model utility (accuracy/AUC) significantly on sensitive ranking tasks

Concrete Example: A social network needs to rank a feed of items for a user. Using a 100B+ MoE model for every item incurs massive prefill latency. A compressed 3B model is needed that mimics the 100B model's ranking decisions without the computational cost.

Key Novelty

Distill-Prune-Redistill Pipeline for RecSys

Combines Knowledge Distillation (transferring knowledge from a 100B+ teacher) with One-Shot Structured Pruning (removing MLP neurons and attention heads) in a specific sequence
Crucially employs a 're-distillation' phase after pruning to recover lost generalization capabilities, using the unpruned student as the new teacher
Validates the pipeline on industrial-scale recommendation workloads (ranking and reasoning) rather than just generic NLP benchmarks

Architecture

The three-stage model compression pipeline: (1) Distillation of FM to Student, (2) One-shot Structured Pruning, (3) Re-distillation/Fine-tuning of the Pruned Model.

Evaluation Highlights

Distilled Llama-3.1-8B model retains ranking performance within -0.06% AUC of the 100B+ foundation model, vastly outperforming standard fine-tuning (-0.62% drop)
Structured pruning of attention heads yields a >28% speedup in prefill latency
In a live A/B test for a reasoning task, the distilled model improved an internal quality metric (IQM) by 20.29% compared to the previous baseline

Breakthrough Assessment

7/10

Strong practical contribution demonstrating a complete recipe for deploying LLMs in large-scale RecSys. While the individual techniques (KD, Pruning) are known, the specific integration and industrial validation on 100B+ models are significant.

⚙️ Technical Details

Problem Definition

Setting: Deploying efficient Small Language Models (SLMs) for ranking and reasoning in Recommendation Systems

Inputs: User history and item text features (concatenated into a prompt)

Outputs: Probability of interaction (for ranking) or reasoning text (for generative tasks)

Pipeline Flow

Foundation Model (Teacher) → Knowledge Distillation → Student Model
Student Model → Structured Pruning (OSSCAR) → Pruned Model
Pruned Model → Re-distillation (Student as Teacher) → Final Efficient SLM

System Modules

Distillation Stage (Compression)

Transfer knowledge from FM to a dense SLM

Model or implementation: Llama-3.1-8B-Instruct or Llama-3.2-3B-Instruct

Pruning Stage (Compression)

Physically reduce model size by removing MLP neurons and attention heads

Model or implementation: Distilled Student Model

Serving Engine

Serve the compressed model with low latency

Model or implementation: SGLang with FP8 Quantization

Modeling

Base Model: Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct (Students); Internal MoE ~100B (Teacher)

Training Method: Knowledge Distillation and Structured Pruning

Objective Functions:

Purpose: Transfer teacher probability distribution to student.

Formally: Loss = 0.9 * ForwardKL(Teacher, Student) + 0.1 * CrossEntropy(GroundTruth)
Purpose: Maintain general knowledge during distillation.

Formally: 5% of loss computed over prompt tokens (not just completion tokens)
Purpose: Reasoning task optimization.

Formally: Two-stage training: (1) SFT on teacher traces, (2) On-policy Forward KL

Adaptation: Full parameter update during distillation; Structured pruning of MLPs and Attention Heads

Trainable Parameters: All parameters of the student model

Key Hyperparameters:

kl_weight: 0.9
ce_weight: 0.1
on_policy_sampling_fraction: 1.0 (or 0.5 for speed)
+ 2 more
temperature: 0.8-0.9 (for reasoning generation)
context_length: up to 32k

Compute: Serving benchmarks on NVIDIA H100 (8 GPUs) and A100. Training hardware not explicitly detailed.

Comparison to Prior Work

vs. SFT: Uses logits from a 100B+ teacher (KD) to preserve generalization better than hard labels
vs. One-shot Pruning: Adds a re-distillation phase to recover accuracy lost during structural removal of heads/neurons
vs. Standard KD [not cited in paper]: specifically targets RecSys distributions and integrates structure pruning in the loop

Limitations

Relies on a high-quality internal Foundation Model as a teacher; results depend on teacher quality
Pruning efficacy decreases significantly at high sparsity ratios (e.g., >20% reduction starts degrading without KD)
Experiments focus on specific industrial RecSys tasks; generalization to other domains is not tested

Reproducibility

Not provided. The Foundation Model is internal/proprietary (LinkedIn). The student models (Llama/Qwen) are public, but the specific training data (RecSys logs) and the trained weights are not released.

📊 Experiments & Results

Evaluation Setup

Industrial Recommendation System ranking and reasoning tasks

Benchmarks:

Internal RecSys Ranking Tasks (Click/Like Prediction (Binary Classification)) [New]
Internal Reasoning Tasks (Text Generation) [New]
AIME 2024 (Mathematical Reasoning)

Metrics:

AUC (Area Under Curve)
Validation Loss
Internal Quality Metric (IQM)
TTFT (Time To First Token)
Throughput
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Predictive task results comparing Knowledge Distillation (KD) against Supervised Fine-Tuning (SFT) relative to the Foundation Model (FM) baseline. KD preserves performance significantly better.
Internal Ranking Tasks	AUC Delta (%)	0.00	-0.06	-0.06
Internal Ranking Tasks	AUC Delta (%)	0.00	-0.62	-0.62
Internal Ranking Tasks	AUC Delta (%)	0.00	-0.15	-0.15
Internal Ranking Tasks	AUC Delta (%)	0.00	-1.21	-1.21
Reasoning task improvements using the proposed distillation recipes on open source models.
AIME 2024	Performance Improvement	0	20	+20
Serving Latency	Prefill Speedup	0	28	+28

Main Takeaways

Knowledge Distillation (KD) is vastly superior to Supervised Fine-Tuning (SFT) for compressing RecSys models, reducing AUC loss from -1.21% (SFT) to -0.15% (KD) for 3B models.
Re-distillation after structured pruning is critical; it recovers almost all performance lost during the pruning step, enabling dense-to-sparse compression without utility loss.
Gradual pruning (multistep) outperforms one-shot pruning, allowing for near-lossless compression from 3B to 2.4B parameters.
Structured pruning of attention heads provides a 40% improvement in attention latency, translating to a >28% end-to-end prefill speedup.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (KD)
Structured Pruning
Quantization (FP8, INT4)
Mixture-of-Experts (MoE)
Recommendation Systems (Ranking/Retrieval)

Key Terms

Foundation Model (FM): The large, internal 100B+ parameter model used as the teacher/source of knowledge

SLM: Small Language Model—compact models (e.g., 1B-8B parameters) suitable for low-latency serving

OSSCAR: A one-shot structured pruning algorithm used to remove redundant components without immediate retraining

Prefill: The initial phase of LLM inference where the prompt is processed; often the bottleneck in ranking tasks where output length is short

TTFT: Time To First Token—a latency metric measuring how long it takes to start generating the response

SGLang: A high-performance serving engine for LLMs utilized for deployment

SFT: Supervised Fine-Tuning—training on labeled data, used here as a baseline to compare against distillation

IQM: Internal Quality Metric—a proprietary metric used by LinkedIn to evaluate model performance in production