REST-PG: Reasoning-enhanced self-training for long-form personalized text generation

📝 Paper Summary

User-profile based personalization Post-training (RL/Self-training)

REST-PG improves personalized text generation by training models to explicitly reason about user preferences and background before answering, using reinforced self-training to discover high-quality reasoning paths.

Core Problem

Personalization often requires using context that seems irrelevant to the prompt but reveals implicit preferences (e.g., mentioning children implies a safety preference), which standard LLMs fail to reason over effectively.

Why it matters:

Existing retrieval methods struggle with 'implicit relevance'—context that isn't semantically similar to the prompt but is crucial for personalization
Human-annotated data for 'personalized reasoning' is scarce and costly, making supervised training difficult
Standard supervised fine-tuning on synthetic reasoning data is insufficient because the model's initial reasoning may not actually align with user preferences

Concrete Example: If a user profile says 'I have two children age 3 and 4' and the user asks 'Suggest room heater brands', a standard model might ignore the children. A personalized model should reason: 'User has young kids -> Safety is a priority -> Suggest heaters with child-safe features.'

Key Novelty

Reasoning-Enhanced Self-Training (REST-PG)

Generates synthetic 'reasoning paths' (summaries of user style/preferences) using an LLM to bridge the gap between user profile and expected output
Uses Expectation-Maximization Reinforced Self-Training to let the model explore different reasoning paths and iteratively train on the ones that produce responses most similar to the user's ground truth
Treats reasoning as a latent variable optimized via reinforcement learning without needing human-labeled reasoning steps

Architecture

Overview of the REST-PG optimization framework including the Expectation and Maximization steps.

Evaluation Highlights

+14.5% average relative performance gain across LongLaMP benchmark tasks compared to Supervised Fine-Tuning (SFT) baselines
Outperforms self-training without reasoning enhancement by 6.5% on average, proving the specific value of the reasoning step
Achieves best-in-class performance on 4 diverse tasks: Email Completion, Abstract Generation, Review Writing, and Topic Writing

Breakthrough Assessment

7/10

Strong methodological contribution by combining reasoning generation with reinforced self-training for personalization. Significant empirical gains on established benchmarks, though relies on synthetic data.

⚙️ Technical Details

Problem Definition

Setting: Personalized text generation given a user profile

Inputs: Input prompt x, User profile P_u (set of unstructured documents)

Outputs: Generated response y_hat maximizing reward R(x, y, y_hat) against expected output y

Pipeline Flow

Input Processing (Combine User Profile + Prompt)
Reasoning Generation (Generate user preference summary)
Response Generation (Generate final answer based on reasoning)

System Modules

Input Processor

Concatenates the user prompt with the retrieved/relevant user profile documents

Model or implementation: Script/Rule-based

Reasoning Generator

Generates a reasoning path (summary of preferences, style, background) based on the context

Model or implementation: Fine-tuned LLM (Gemma 7B)

Response Generator

Generates the final personalized response conditioned on the input and the generated reasoning path

Model or implementation: Fine-tuned LLM (Gemma 7B)

Novel Architectural Elements

Single-pass inference generating explicit reasoning followed by response
Iterative Expectation-Maximization loop where the 'expected output' includes the generated reasoning path that led to a high-reward final answer

Modeling

Base Model: Gemma 7B

Training Method: Expectation-Maximization Reinforced Self-Training (ReST-EM)

Objective Functions:

Purpose: Maximize expected reward of generated sequences.

Formally: Iterative EM where E-step collects high-reward samples (reward > tau) and M-step minimizes weighted Seq2Seq loss.
Purpose: Reward function for filtering.

Formally: R(x, y, y_hat) based on similarity between generated output y_hat and ground truth y (using an LLM-based evaluator).

Adaptation: Full fine-tuning

Training Data:

Step 1: Use LLM to generate initial reasoning paths (silver data) for SFT.
Step 2 (Iterative): Generate M outputs per input with temperature gamma. Keep outputs with Reward > tau (max 10 per input).

Key Hyperparameters:

decoding_temperature_gamma: Not explicitly reported in the paper
reward_threshold_tau: Not explicitly reported in the paper
max_outputs_retained: 10
+ 2 more
LLM_for_reasoning_generation: Gemma 7B
LLM_for_evaluation: Gemma 7B

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT: REST-PG adds an intermediate reasoning step and uses RL to optimize it
vs. Standard Self-Training: REST-PG explicitly learns to generate reasoning paths, not just final outputs
vs. CoT: REST-PG trains the model to generate user-specific reasoning (preferences/style) rather than just logical deduction, and optimizes it via RL
+ 1 more
vs. STaR (Self-Taught Reasoner) [not cited in paper]: Similar EM loop, but REST-PG focuses on personalization context (implicit preferences) rather than logical QA correctness

Limitations

Relies on an LLM-based reward model (Gemma 7B) rather than human preference, which may introduce bias
Computationally expensive due to the iterative generation (M samples per input) in the Expectation step
Evaluated only on LongLaMP benchmark; generalization to other personalization tasks is untested
No cost analysis of the multi-step generation vs. standard generation provided

Reproducibility

Code availability is not provided. The paper relies on Gemma 7B for both generation and evaluation. Prompts for reasoning generation and evaluation are provided in Appendices A and C.

📊 Experiments & Results

Evaluation Setup

Personalized long-form text generation using user profiles

Benchmarks:

LongLaMP (Personalized Text Generation (Email, Abstract, Review, Topic))

Metrics:

LLM-based Evaluator Score (0-1 normalized)
ROUGE (mentioned as standard but LLM evaluator preferred)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on LongLaMP benchmark tasks (LLM-evaluator metric, 0-1 scale) showing REST-PG outperforms baselines.
LongLaMP (Average)	LLM Score	0.3976	0.4554	+0.0578
LongLaMP (Average)	LLM Score	0.3905	0.4554	+0.0649
LongLaMP (Average)	LLM Score	0.4274	0.4554	+0.0280
LongLaMP-3 (Review Writing)	LLM Score	0.6525	0.7077	+0.0552
LongLaMP-4 (Topic Writing)	LLM Score	0.2270	0.3238	+0.0968

Main Takeaways

REST-PG consistently outperforms SFT and standard Self-Training across all four personalization tasks.
Adding reasoning *without* reinforcement learning (SFT w/ Reasoning-Enhancement) performs worse than standard SFT, suggesting generated reasoning paths are often low-quality initially.
The exploration step in RL is crucial: it allows the model to find reasoning paths that actually align with the ground truth, correcting the initial low-quality reasoning.
The method is effective for capturing implicit relevance (e.g., user style or background) that standard retrieval or context augmentation might miss.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Self-Training / Expectation-Maximization
Large Language Models (LLMs)
Retrieval-Augmented Generation (RAG)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

REST-PG: Reasoning-Enhanced Self-Training for Personalized Text Generation—the proposed framework

SFT: Supervised Fine-Tuning—training a model on labeled examples

LLM: Large Language Model—a deep learning model trained on vast amounts of text

EM: Expectation-Maximization—an iterative method to find maximum likelihood estimates, used here to alternate between generating data (E-step) and training on it (M-step)

LongLaMP: Long-form Language Model Personalization benchmark—a dataset for evaluating personalized text generation

RL: Reinforcement Learning—training models to make sequences of decisions to maximize a reward

Reasoning Path: Intermediate text generated by the model explicitly analyzing user preferences/style before generating the final answer

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics for evaluating automatic summarization and translation

Gemma: A family of open weights LLMs developed by Google DeepMind

Seq2Seq Loss: Sequence-to-Sequence Loss—typically cross-entropy loss used to train models to map input sequences to output sequences