From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production

📝 Paper Summary

LLM-based Recommendation Generative Recommender Systems Prompt Optimization / Verbalization

This paper proposes a framework that learns to translate raw user interaction logs into optimized natural language summaries (verbalizations) by using recommendation accuracy as a reward signal for reinforcement learning.

Core Problem

Current LLM-based recommenders rely on rigid, template-based methods to convert user history into text, which creates parsing overhead, includes noise, and lacks semantic context.

Why it matters:

Template-based concatenation forces LLMs to reason over granular, heterogeneous logs rather than synthesized user preferences
Valuable signal is lost when raw logs are not summarized or enriched with metadata, hurting performance on cold-start items
Standard prompt engineering is insufficient because optimal verbalization depends on specific user history instances

Concrete Example: A template might output '20250608, Monday... Play, Duration: 80.08 min', which is noisy and hard to parse. An optimized verbalizer might rewrite this as 'The user showed strong preference for dark thrillers by binge-watching 5 episodes of Stranger Things', directly exposing the preference signal.

Key Novelty

Two-Stage GRPO Framework for Verbalization and Reasoning

Decomposes recommendation into a 'Verbalizer' (rewrites logs into text) and a 'Reasoner' (predicts next item from text)
Trains the Verbalizer using RL (GRPO) where the reward comes from an Oracle LLM's prediction accuracy on the rewritten text
Subsequently fine-tunes the Reasoner on the distribution of optimized verbalizations

Evaluation Highlights

Achieves 92.9% relative improvement in discovery item recommendation accuracy over template-based baselines on a large industrial dataset
The Verbalizer's learned transformations alone contribute significantly (roughly 50 percentage points of the total gain) compared to just training the Reasoner on raw templates
Rewrite-based verbalization outperforms action-based (filtering/enriching) verbalization by enabling aggregation and summarization strategies

Breakthrough Assessment

8/10

Significant industrial application demonstrating that learning how to represent data (verbalization) is as critical as the reasoning model itself. The huge relative gains (+93%) suggest a major inefficiency in current template-based approaches.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation as a generative reranking task

Inputs: User interaction history H_u (heterogeneous features: timestamp, item ID, duration, etc.) and candidate set C

Outputs: Predicted next item y* from candidate set C

Pipeline Flow

Verbalizer (Rewrites raw history H into text x)
Reasoner (Predicts item y from text x and candidates C)

System Modules

Verbalizer

Transform raw interaction logs into optimized natural language summaries

Model or implementation: Qwen-3 (8B or 32B parameters)

Reasoner

Predict the next item the user will engage with

Model or implementation: Causal Language Model (Qwen-3 family)

Novel Architectural Elements

Decoupled Verbalizer-Reasoner architecture where the Verbalizer is an explicit learnable module rather than a fixed template
Use of an external Oracle Reasoner to provide stable reward signals for Verbalizer optimization

Modeling

Base Model: Qwen-3 (8B and 32B variants)

Training Method: Two-stage Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize Verbalizer to produce text that maximizes downstream accuracy.

Formally: Maximize GRPO objective using rewards from Oracle Reasoner prediction accuracy (0 or 1) + length penalty.
Purpose: Optimize Reasoner to predict correctly given verbalized text.

Formally: Maximize GRPO objective using rewards from ground truth engagement (0 or 1).

Key Hyperparameters:

length_reward_alpha: 0.9
length_target_range: 0.3-0.7 (compression ratio)

Compute: Not reported in the paper

Comparison to Prior Work

vs. P5: P5 uses fixed templates; this work learns the template/verbalization function dynamically per user.
vs. Prompt Tuning: This work generates discrete, human-readable natural language verbalizations rather than continuous soft embeddings.
vs. In-Context Learning (Liu et al., 2023b): Learns to construct the context itself rather than just utilizing provided context [not cited in paper].

Limitations

Relies on a powerful Oracle Reasoner for training, which may be computationally expensive or inaccessible.
Evaluated on a single proprietary dataset (industrial streaming), limiting reproducibility and generalization checks.
No direct analysis of latency costs introduced by the generation step of the Verbalizer in a production setting.

Reproducibility

Code availability is not provided. Dataset is proprietary industrial streaming data (Netflix). Baselines (Template, Zero-Shot) are reproducible concepts but exact implementation depends on data schema.

📊 Experiments & Results

Evaluation Setup

Reranking task: predict next engaged item from 10 candidates given user history (up to 100 interactions).

Benchmarks:

Industrial Streaming Dataset (Sequential Recommendation (Reranking)) [New]

Metrics:

Recall@1 for Discovery (identifying new content matching preferences)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing the impact of learned verbalization and the full pipeline against baselines.
Industrial Streaming Dataset	Relative Recall@1 Improvement	0.0	5.3	+5.3
Industrial Streaming Dataset	Relative Recall@1 Improvement	0.0	12.5	+12.5
Industrial Streaming Dataset	Relative Recall@1 Improvement	0.0	92.9	+92.9
Industrial Streaming Dataset	Relative Recall@1 Improvement	0.0	42.8	+42.8

Main Takeaways

Verbalization is not just preprocessing; it significantly impacts model reasoning capabilities.
Rewrite-based verbalization enables emergent strategies like summarizing user interests and filtering noise, outperforming simple action-based filtering.
The synergy between Verbalizer and Reasoner is strong: the Verbalizer creates a more 'learnable' distribution for the Reasoner, doubling the performance gain compared to training the Reasoner alone.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Human Feedback (RLHF) concepts
Generative Recommender Systems
LLM prompting strategies

Key Terms

verbalization: The process of converting structured data (like user logs) into natural language text for LLM input

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs, avoiding the need for a separate value function

Recall@1 for Discovery: A metric measuring how often the model correctly predicts a relevant item that the user has not previously watched

Oracle Reasoner: A powerful, fixed LLM used during training to provide reward signals to the Verbalizer, ensuring the Verbalizer learns robust representations

plateau function: A reward function component that keeps values high within a specific range (e.g., length ratio 0.3-0.7) and penalizes values outside it

cold-start items: Items with little or no historical interaction data, making them difficult for traditional collaborative filtering to recommend

heterogeneous feature space: Data containing mixed types of information (time, text, categorical IDs, continuous numbers) that can be difficult for models to process uniformly