← Back to Paper List

Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth

Faye Zhang, Qianyu Cheng, Jasmine Wan, Vishwakarma Singh, Jinfeng Rao, Kofi Boakye
Pinterest, Stanford University
arXiv (2026)
MM Agent Recommendation RAG

📝 Paper Summary

Generative Engine Optimization (GEO) Vision-Language Model (VLM) Applications Multi-Agent Systems
Pinterest GEO fine-tunes VLMs to predict user search intent from images and employs AI agents to mine real-time trends, creating optimized landing pages that drive organic traffic growth.
Core Problem
Visual content lacks the textual depth and authority signals required by new generative search engines (like ChatGPT and Google AI Overviews), causing visual platforms to lose traffic as answers are synthesized directly on search result pages.
Why it matters:
  • Traditional SEO relies on keyword matching, but generative engines (GEO) prioritize synthesized answers and citation authority, leaving image-heavy platforms invisible.
  • Visual assets lack inherent lexical metadata to signal relevance to complex user questions (e.g., 'garden party outfit ideas' vs. just 'pink dress').
  • Search demand is non-stationary; by the time trends appear in internal logs, the opportunity to capture traffic for emerging topics is often lost.
Concrete Example: A user searches 'modern office outfits 2026'. A standard captioning model labels an image 'woman in grey blazer,' which generic search engines ignore. The proposed system generates the intent-aligned query 'Modern Office Outfits for Women' and links it to a curated collection, causing the generative engine to cite the collection as an authoritative source.
Key Novelty
Reverse Search Design for Visual GEO
  • Shifts from 'image captioning' (describing what is visible) to 'intent prediction' (predicting what users would type to find the image).
  • Uses autonomous agents to mine external trends (Google Trends) and generate content *before* those signals appear in internal user logs.
  • Constructs 'link equity' programmatically by using VLM-generated queries to build dense internal linking structures between assets and collection pages.
Architecture
Architecture Figure Figure 2
The end-to-end framework illustrating how images are processed into queries and how agents mine trends to create collections.
Evaluation Highlights
  • 20% organic traffic growth achieved in production deployment across billions of images.
  • 94x lower inference cost compared to using commercial VLM APIs for the same task.
  • 19% improvement in topic-query alignment over production baselines using search performance signal fine-tuning.
Breakthrough Assessment
9/10
Highly significant industrial application. It formalizes 'Visual GEO' as a new problem class and successfully deploys a massive-scale solution (billions of assets) combining VLMs, agents, and graph theory to solve the existential threat of generative search.
×