ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation

📝 Paper Summary

Agentic AI E-commerce Search Synthetic Data Generation

ProductResearch improves e-commerce shopping agents by training them on synthetic trajectories generated by a multi-agent system where supervisory feedback is distilled into the agent's own reflective reasoning.

Core Problem

Existing e-commerce agents lack the depth for complex research, while open-domain 'Deep Research' agents struggle with domain gaps like mixing web search with strict product catalog queries.

Why it matters:

Modern users need comprehensive analysis (e.g., comparing technical specs for professional gear), not just simple item retrieval or binary recommendations
Open-domain research models hallucinate or fail to use specialized e-commerce tools effectively
ReAct-style agents often prioritize short-term task completion over the evidentiary rigor required for high-stakes purchasing decisions

Concrete Example: When a user asks for 'a professional camera system for specific environmental conditions,' a standard Deep Research agent might hallucinate features or fail to check inventory. The proposed system ensures the agent verifies claims against the product catalog and synthesizes expert reviews into a structured report.

Key Novelty

Multi-Agent Synthetic Trajectory Distillation

Employs a Supervisor Agent with a state machine to critique the Research Agent at every step (Plan, Tool Use, Report), ensuring logic and coverage
Uses 'Reflective Internalization' to convert multi-turn supervisor-worker arguments into coherent single-turn 'thought' traces, allowing standard models to learn from the supervision without needing a supervisor at inference time
User Agent generates dynamic, query-specific evaluation rubrics (weights for depth, readability, etc.) to guide the generation process

Architecture

The ProductResearch framework workflow: User Profiling -> Supervised Research Execution -> Distillation

Evaluation Highlights

Fine-tuned Qwen3-30B-A3B achieves a RACE score of 45.40, outperforming its base model (31.78) and the open-source Tongyi-DeepResearch (29.84)
The model achieves an Effective Product Count of 12.45, more than tripling the base model's 3.58, indicating significantly broader product coverage
Performance matches proprietary frontier systems like Gemini-DeepResearch (45.56) on the e-commerce research benchmark

Breakthrough Assessment

8/10

Successfully adapts the 'Deep Research' paradigm to the constrained e-commerce domain using a clever distillation method that internalizes supervision, enabling small models to match proprietary giants.

⚙️ Technical Details

Problem Definition

Setting: Generating comprehensive e-commerce product research reports based on complex user queries

Inputs: User's behavioral history (purchases, reviews) and a specific complex research query

Outputs: A detailed, evidence-grounded product research report synthesizing open web info and catalog data

Pipeline Flow

Phase 1: User Agent (Intent & Rubric Generation)
Phase 2: Supervisor-Guided Research Loop (Plan -> Tool -> Report)
Phase 3: Trajectory Distillation (Reflective Internalization)

System Modules

User Agent (Data Generation)

Simulates user persona from history, generates complex query Q, and creates dynamic evaluation criteria (weights for comprehensiveness, depth, etc.)

Model or implementation: Not explicitly named (likely a strong proprietary model)

Research Agent (Data Generation)

Executes the research task via iterative reasoning and tool usage

Model or implementation: LLM in ReAct loop

Supervisor Agent (Data Generation)

Monitors every step of the Research Agent using a state machine

Model or implementation: LLM with state-specific prompts

Novel Architectural Elements

Reflective Internalization mechanism: Distills [Assistant -> Supervisor Reject -> Assistant Revise] sequences into a single [Assistant (Self-Correction)] turn for SFT
Dynamic Evaluation Criteria Generation: The User Agent creates a query-specific grading rubric that the Supervisor uses to critique the Research Agent

Modeling

Base Model: Qwen3-30B-A3B (a compact Mixture-of-Experts model)

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

1,000 representative user personas curated from logs
Dataset partitioned 8:1:1 (train/val/test)
Trajectories filtered by length (minimum turn threshold)

Key Hyperparameters:

context_length_variants: 32k to 128k tokens

Compute: 32x A100 GPU cluster using Megatron-LM

Comparison to Prior Work

vs. Deep Research (Tongyi): ProductResearch integrates specific product catalog tools and domain constraints, avoiding the tool-use generalization issues of open-domain models
vs. Standard ReAct: ProductResearch uses long-horizon synthetic trajectories with internalized self-correction, enabling deeper analysis than simple ReAct loops
vs. STaR [not cited in paper]: Similar to Self-Taught Reasoner but explicitly uses a distinct Supervisor Agent with a dynamic rubric rather than binary success/failure signals

Limitations

Reliance on a strong Supervisor Agent for data synthesis; the quality of the student is upper-bounded by the supervisor
Computational cost of generating long-horizon multi-agent trajectories for training data
Domain specificity to e-commerce; transferability to other vertical domains (e.g., medical, legal) is not explicitly tested

Reproducibility

Code and trained model weights are not provided. The dataset construction methodology is described, but the specific user logs are internal. Prompts are provided in Appendix D.

📊 Experiments & Results

Evaluation Setup

Product research report generation based on complex user queries

Benchmarks:

ProductResearch Dataset (Long-form report generation) [New]

Metrics:

RACE (Report Agent Comparison Evaluation)
Effective Product Count (E.Prod)
Dimension scores: Comprehensiveness, Depth, Instruction-Following, Readability
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing the fine-tuned model outperforms baselines and approaches proprietary systems.
ProductResearch Dataset	RACE Score	31.78	45.40	+13.62
ProductResearch Dataset	RACE Score	29.84	45.40	+15.56
ProductResearch Dataset	Effective Product Count (E.Prod)	3.58	12.45	+8.87
Context length scaling experiments showing the necessity of long context for deep research.
ProductResearch Dataset	RACE Score	37.75	45.40	+7.65

Experiment Figures

Performance (RACE score) scaling with training context length (32k, 64k, 96k, 128k)

Evolution of report quality scores across iterative supervision rounds (Round 1 to 6)

Main Takeaways

The fine-tuned compact model (30B) approaches the performance of frontier proprietary systems (Gemini-DeepResearch) when trained on high-quality synthetic trajectories
Context length is critical; scaling from 32k to 128k tokens yields consistent improvements across all report quality dimensions
The Supervisor Agent's iterative feedback significantly improves intermediate report quality, with the largest gains seen in the first two rounds of supervision
Open-source Deep Research models trained on general web search struggle to generalize to e-commerce specific tools without domain-specific fine-tuning

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based Agents (ReAct framework)
Knowledge of Supervised Fine-Tuning (SFT)
Familiarity with Synthetic Data Generation

Key Terms

ReAct: Reason+Act—a paradigm where LLMs interleave reasoning traces with tool execution steps to solve tasks

Deep Research: An agentic paradigm focused on long-horizon, multi-step information synthesis and report generation, rather than simple Q&A

MoE: Mixture of Experts—a neural network architecture where different sub-models (experts) specialize in different parts of the input space

RACE: Report Agent Comparison Evaluation—a metric for scoring research reports against a reference based on criteria like comprehensiveness and depth

Reflective Internalization: A process of rewriting multi-agent feedback loops into a single agent's self-correction thought process for training data

Hallucination: When an LLM generates factually incorrect information or references non-existent products

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset to adapt it to a task

Trajectory: The sequence of thoughts, actions (tool calls), and observations an agent takes to solve a problem