User Review Writing via Interview with Dialogue Systems

📝 Paper Summary

Conversational personalization Agentic AI

A dialogue system acts as an interviewer to elicit detailed product feedback from users, then automatically generates a structured review and predicts a rating based on the conversation history.

Core Problem

Writing high-quality, detailed user reviews is time-consuming and labor-intensive for humans, while existing automated generation methods lack sufficient subjective details to be truly personalized.

Why it matters:

Detailed reviews are crucial for other buyers' decision-making and provide valuable feedback for sellers to improve product quality
Existing automated methods rely on limited inputs (e.g., just ratings or images) and struggle to incorporate the user's specific personal experiences without direct input
Reducing the burden of writing encourages more users to share valuable feedback that might otherwise remain unwritten

Concrete Example: A user might want to review an electric shaver but finds writing a full paragraph tedious. Without this system, they might leave a star rating only. With this system, they chat briefly about 'small hair issues', and the system generates a full review: '...well satisfied but... some times small hair from the beard gets stucks'.

Key Novelty

Interactive Interview-to-Review Generation

Replaces the unidirectional writing process with a bidirectional interview where a dialogue agent actively asks follow-up questions to elicit specific pros/cons
Transforms the resulting conversational history into a non-conversational review format using a generative model, rather than just summarizing it
Decouples the rating process by predicting a score based on the generated text's sentiment, aiming to reduce subjective bias in manual rating assignment

Architecture

The sequential pipeline of the proposed review generation system.

Evaluation Highlights

Review readers rated system-generated reviews as more helpful than human-written reviews (55% win rate vs. 23% for human)
System-generated reviews required less editing for user satisfaction compared to a baseline with fixed questions (only 27% of users needed >50% rewriting vs. 38% for baseline)
Users rated the interaction with the interview system as significantly more 'fun' compared to a static baseline questionnaire system

Breakthrough Assessment

6/10

A novel application of LLMs for interactive content creation. While the underlying tech (GPT-4) is standard, the interview-based workflow for eliciting detailed structured data is a practical UX innovation.

⚙️ Technical Details

Problem Definition

Setting: Interactive text generation where a system elicits information $I$ from user $U$ via dialogue $D$ to generate review $R$ and rating $S$.

Inputs: User responses to system-generated interview questions

Outputs: A finalized user review text and a predicted star rating (1-5)

Pipeline Flow

Interview Dialogue System (conducts interview)
Review Text Generator (converts dialogue to review)
Rating Predictor (predicts score from review)

System Modules

Interview Dialogue System

Act as an interviewer to elicit user opinions, asking follow-up questions or changing topics

Model or implementation: gpt-4-0613

Review Text Generator

Summarize the dialogue history into a coherent review text from the user's perspective

Model or implementation: gpt-4-0613

Rating Predictor

Predict a consistent star rating (1-5) based on the generated review text

Model or implementation: gpt-4-0613

Novel Architectural Elements

Integration of an active interviewing module that dynamically explores ambiguous user answers to feed a downstream generation module
Separation of review text generation and rating prediction to ensure the rating objectively reflects the text content

Modeling

Base Model: GPT-4 (gpt-4-0613)

📊 Experiments & Results

Evaluation Setup

Human evaluation via Amazon Mechanical Turk involving both system users (who chatted with the bot) and third-party readers (who judged the reviews).

Benchmarks:

User Satisfaction Survey (Human evaluation of system interaction) [New]
Reader Helpfulness Evaluation (Pairwise comparison of generated vs. human reviews) [New]

Metrics:

User rating of 'Fun'
User rating of 'Burden'
Amount of rewriting needed
Reader preference (Helpful, Pros/Cons, Comprehensive)
Statistical methodology: Mann–Whitney U test used for user enjoyment comparison (p < 0.05).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reader evaluations show the proposed system generates more helpful and comprehensive reviews than human writers or a fixed-question baseline.
Review Helpfulness	Win Rate vs Human	23	55	+32
Review Comprehensiveness	Win Rate vs Human	23	60	+37
Balanced Pros/Cons	Win Rate vs Human	16	63	+47
User experience metrics reveal a trade-off: the dynamic system is more fun and produces better drafts, but is perceived as more burdensome due to latency.
Editing Effort	% of users needing >50% rewrite	38	27	-11

Experiment Figures

Distribution of participant responses to survey questions (Likert scale) comparing the proposed system vs. baseline.

Pie charts showing how much users felt they needed to rewrite the generated review.

Main Takeaways

Dynamic interviewing elicits more comprehensive information than fixed questionnaires, leading to reviews that readers find more helpful.
Users find the interactive chat more 'fun' than filling out forms, but the latency of GPT-4 creates a perception of higher burden.
Automated rating prediction based on the generated text aligns better with objective third-party assessments than with the users' own subjective ratings.
System-generated reviews are consistently rated as more balanced (pros vs cons) compared to human-written reviews.

📚 Prerequisite Knowledge

Prerequisites

Basics of dialogue systems and prompt engineering
Understanding of text summarization vs. style transfer
Familiarity with Likert scale evaluations

Key Terms

chain-of-thought prompting: A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer to improve logic and accuracy

ROUGE-L: A metric for evaluating text generation by measuring the longest common subsequence between the generated text and a reference text

Likert scale: A psychometric scale commonly involved in questionnaires (e.g., 1 to 5) to measure people's attitudes or opinions

baseline system: In this paper, a system that asks a fixed sequence of pre-defined questions rather than dynamically generating follow-up questions

temperature: A hyperparameter in language models that controls randomness; lower values (e.g., 0) make outputs more deterministic/focused, while higher values make them more diverse

MTurk: Amazon Mechanical Turk—a crowdsourcing marketplace used here to recruit human participants for experiments