RGAlign-Rec: Ranking-Guided Alignment for Latent Query Reasoning in Recommendation Systems

📝 Paper Summary

Proactive Intent Prediction Chatbot Recommendations LLM Alignment for Recommendation

RGAlign-Rec aligns an LLM's semantic reasoning with downstream ranking objectives by using a query-enhanced ranking model as a reward signal for refining the LLM's latent query representations.

Core Problem

In chatbot recommendations, there is a semantic gap between discrete user features and fine-grained intents, and a misalignment between general LLM objectives (fluency) and task-specific ranking goals (CTR).

Why it matters:

Standard recommendation models rely on ID-based collaborative signals, failing to capture the semantic nuance of service issues in zero-query chatbot settings.
General-purpose LLMs produce representations optimized for human readability, which are often sub-optimal for ranking metrics like Click-Through Rate.
Proactive prediction of user intent (e.g., delivery delays) is critical for reducing friction in customer service chatbots handling millions of daily interactions.

Concrete Example: A user has an order stuck in 'To Receive' for 7 days. A standard model sees discrete IDs and might miss the urgency. A general LLM might describe the issue fluently ('The user is waiting') but fail to map it to the specific KB intent 'Speed up parcel delivery' because the embedding isn't aligned with the ranking space.

Key Novelty

Closed-Loop Ranking-Guided Alignment (RGA)

Treats the recommendation ranking model as a Reward Model (RM) to provide feedback to the LLM reasoner, rather than relying on human preference labels.
Introduces a 'Query-Enhanced' three-tower architecture where an LLM synthesizes a latent query from user history, which is then explicitly aligned with item and user towers.
Uses a multi-stage alignment process: first training the ranker, then using the ranker to select best-of-N LLM queries for supervised fine-tuning and contrastive learning.

Architecture

The three-stage RGAlign-Rec framework: (1) Training QE-Rec with a frozen LLM, (2) Ranking-Guided Alignment (RGA) where the QE-Rec acts as a Reward Model to select best teacher queries for LLM fine-tuning, and (3) Re-training QE-Rec with the aligned LLM.

Evaluation Highlights

+0.12% absolute GAUC improvement on a large-scale Shopee industrial dataset, representing a 3.52% relative reduction in error rate.
+0.98% CTR improvement in online A/B testing from the Query-Enhanced model alone, with an additional +0.13% gain after Ranking-Guided Alignment.
+0.56% improvement in Recall@3 compared to strong baselines like DIN and SASRec.

Breakthrough Assessment

7/10

Solid industrial application of LLM-for-RecSys. The closed-loop alignment using the ranker as a reward model is a practical, effective strategy for domain-specific alignment without expensive human annotation.

⚙️ Technical Details

Problem Definition

Setting: Zero-query intent prediction formulated as a top-K ranking task.

Inputs: Heterogeneous user signals (profiles, order status, dialogue logs) without an explicit user query.

Outputs: A ranked list of K candidate intents from a Knowledge Base.

Pipeline Flow

Feature Verbalization → LLM Reasoner
LLM Reasoner → Query-Enhanced Ranking Model (QE-Rec)
Ranking Model → Reward Signal (Feedback Loop)

System Modules

Feature Verbalizer

Transforms discrete features (IDs, timestamps) into natural language descriptions using metadata dictionaries.

Model or implementation: Rule-based mapping

User Query Reasoner

Synthesizes a latent query describing user intent from the verbalized context.

Model or implementation: Qwen3-4B (fine-tuned)

Query-Enhanced Ranker (QE-Rec)

Scores candidate intents by combining user-intent signals with query-intent semantic matching.

Model or implementation: Three-tower architecture (User, Intent, Query)

Novel Architectural Elements

Query-Enhanced Three-Tower Architecture: Adds a specific 'Query Tower' that projects LLM embeddings into the ranking space, alongside standard User and Intent towers.
Closed-Loop Alignment: The ranking model (QE-Rec) is used as the Reward Model to select training samples for the LLM, creating a feedback cycle.

Modeling

Base Model: Qwen3-4B (LLM Reasoner)

Training Method: Ranking-Guided Alignment (RGA) involving SFT and Contrastive Learning

Objective Functions:

Purpose: Optimize the ranking model to order intents correctly.

Formally: Weighted ListNet loss minimizing KL divergence between predicted scores and click distributions.
Purpose: Align LLM embeddings with ranking objectives via Contrastive Learning.

Formally: InfoNCE-style loss maximizing similarity between the generated query embedding and the positive intent embedding while minimizing similarity to negatives.

Adaptation: Full fine-tuning of Qwen3-4B using RG-SFT and RG-CL

Training Data:

Teacher Distillation: Uses Gemini-2.5-Pro, GPT-5, and CompassMax to generate diverse latent queries.
Reward Labeling: The pre-trained QE-Rec scores these candidates; the highest-scoring query is selected as the label for SFT.

Key Hyperparameters:

loss_type: ListNet (ranking), InfoNCE (contrastive)
pooling_strategy: Last Token Pooling

Compute: Not reported in the paper

Comparison to Prior Work

vs. SASRec/DIN: RGAlign-Rec explicitly models latent semantic intent via an LLM, addressing the semantic gap in zero-query settings.
vs. HLLM/RecGPT: Instead of using the LLM as a static feature extractor or independent generator, RGAlign-Rec aligns the LLM using feedback from the downstream ranking model.
vs. DPO-Rec [not cited in paper]: DPO-Rec aligns to preferences directly; RGAlign-Rec uses an intermediate ranking model as a proxy reward to guide representation learning.

Limitations

Reliance on a proprietary industrial dataset (Shopee) limits reproducibility.
Computational cost of the three-stage alignment process is likely high, though not explicitly detailed.
The approach assumes the teacher LLMs (GPT-5, Gemini) generate high-quality initial queries.

Reproducibility

No replication artifacts mentioned in the paper. Code, data, and model weights are not provided. The dataset is a proprietary industrial dataset from Shopee.

📊 Experiments & Results

Evaluation Setup

Offline evaluation on Shopee industrial dataset and Online A/B testing.

Benchmarks:

Shopee Industrial Dataset (Intent Prediction / Recommendation) [New]

Metrics:

GAUC (Group AUC)
Recall@3
MRR (Mean Reciprocal Rank)
CTR (Click-Through Rate)
CSAT (Customer Satisfaction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Offline performance comparisons on the Shopee dataset show RGAlign-Rec outperforming traditional deep learning models and other LLM-based approaches.
Shopee Dataset	GAUC	0.8924	0.9022	+0.0098
Shopee Dataset	Recall@3	0.8123	0.8179	+0.0056
Online A/B testing results demonstrate real-world impact on Click-Through Rate.
Shopee Production System	CTR	0.00	0.98	+0.98%
Shopee Production System	CTR	0.00	0.13	+0.13%

Main Takeaways

Integrating an aligned LLM reasoner significantly improves intent prediction in zero-query scenarios compared to purely ID-based methods.
The 'semantic gap' is effectively bridged by the three-tower QE-Rec architecture.
Ranking-Guided Alignment (RGA) provides additive gains over simply adding an LLM tower, proving the value of aligning the semantic space with the ranking space.
The approach scales to industrial settings, delivering consistent CTR and CSAT improvements.

📚 Prerequisite Knowledge

Prerequisites

Two-tower recommendation architectures
Large Language Models (LLMs) and embeddings
Contrastive Learning
Supervised Fine-Tuning (SFT)

Key Terms

Zero-Query Setting: A scenario where the system must predict user intent based on context before the user types any text.

GAUC: Group AUC—Area Under the Curve calculated per user group, measuring ranking quality within user sessions.

LTP: Last Token Pooling—using the hidden state of the final token of an LLM generation as the sequence representation.

CoT: Chain-of-Thought—prompting the LLM to generate intermediate reasoning steps (the latent query) before the final output.

Best-of-N: A sampling strategy where N candidate outputs are generated, and the best one is selected (here, by the ranking model) for training.

DLRM: Deep Learning Recommendation Model—standard architectures for industrial ranking using embeddings and interaction layers.

CSAT: Customer Satisfaction—a metric measuring how satisfied users are with the service.

ListNet: A learning-to-rank loss function that optimizes the probability distribution of the entire ranked list.

SFT: Supervised Fine-Tuning—updating a pre-trained model on a smaller, labeled dataset.

CTR: Click-Through Rate—the ratio of users who click on a specific link to the number of total users who view a page.