GeoGR: A Generative Retrieval Framework for Spatio-Temporal Aware POI Recommendation

📝 Paper Summary

Next Point-of-Interest (POI) Recommendation Generative Retrieval Spatio-Temporal Modeling

GeoGR adapts Large Language Models for next-POI prediction by encoding locations into hierarchical semantic IDs that explicitly capture spatio-temporal collaborative patterns, then training the model via continued pre-training and supervised fine-tuning.

Core Problem

Existing LLM-based POI recommenders rely on non-semantic identifiers or purely textual embeddings that fail to capture collaborative cross-category relationships (e.g., airport→hotel→parking) and struggle with the sparsity of real-world navigation data.

Why it matters:

Accurate prediction is critical for large-scale navigation platforms serving billions of users with diverse needs (dining, tourism, fueling).
Traditional sequential models miss the semantic reasoning of LLMs, while standard LLM approaches miss the structured spatio-temporal dependencies inherent in mobility data.

Concrete Example: A user searches for 'dinner' near a specific location. A standard LLM might recommend a generic popular restaurant based on text similarity. GeoGR, understanding the user's specific trajectory (e.g., arriving from an airport), recommends a hotel restaurant with parking, leveraging learned collaborative signals between these distinct categories.

Key Novelty

Geo-Aware Generative Recommendation Framework

Constructs 'Semantic IDs' (SIDs) for POIs not just from text, but by explicitly modeling geographically constrained co-visitation patterns using contrastive learning.
Aligns the LLM with these new SIDs through a two-stage process: Continued Pre-Training (CPT) on template-based tasks to learn the 'language' of SIDs, followed by Supervised Fine-Tuning (SFT) for the specific next-POI prediction task.

Architecture

The overall framework of GeoGR, split into two main stages: (1) Geo-aware SID Construction and (2) Generative POI Recommendation Training.

Evaluation Highlights

Online A/B testing on the AMAP platform (millions of users) demonstrated significant boosting of multiple online metrics.
Offline experiments on real-world datasets show superiority over state-of-the-art baselines (specific numbers not provided in snippet but claimed).

Breakthrough Assessment

8/10

Strong industrial application with a novel approach to 'grounding' LLMs in spatio-temporal data via specialized tokenization. Successfully deployed on a massive scale (AMAP).

⚙️ Technical Details

Problem Definition

Setting: Next POI recommendation formulated as a conditional probability maximization problem.

Inputs: User interaction history T_u (sequence of POIs, times, conditions) and current context con_u (time, location, query).

Outputs: The next POI p_{n+1} (represented as a sequence of Semantic ID tokens).

Pipeline Flow

Group 1: SID Construction: POI Representation Learning → Tokenization → Refinement
Group 2: Generative Training: Continued Pre-Training (CPT) → Supervised Fine-Tuning (SFT) → Inference

System Modules

POI Encoder (SID Construction)

Generate dense embeddings for POIs incorporating text and spatial context.

Model or implementation: Qwen 4B embedding (fine-tuned)

Tokenizer (RQ-Kmeans) (SID Construction)

Convert dense POI embeddings into discrete hierarchical Semantic IDs.

Model or implementation: Hierarchical K-means clustering

SID Refiner (SID Construction)

Iteratively improve SIDs to be more predictable by the LLM.

Model or implementation: EM-style optimization algorithm

Generative Recommender

Predict the next POI's SID sequence given user context.

Model or implementation: Qwen 4B (CPT + SFT)

Novel Architectural Elements

Geo-aware SID tokenization pipeline that injects spatio-temporal collaborative signals directly into the ID creation process via contrastive learning on co-visit pairs.
EM-style iterative refinement loop where the LLM and the SID codebook mutually update to maximize learnability.

Modeling

Base Model: Qwen 4B

Training Method: Continued Pre-Training (CPT) followed by Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Learn collaborative POI representations.

Formally: NCE loss L_{contrast} = -log( exp(sim(e_i, e_j)/tau) / sum(exp(sim(e_i, e_k)/tau)) )
Purpose: Align LLM with SID tokens (CPT) and Learn Next-POI prediction (SFT).

Formally: Negative Log-Likelihood (NLL) loss over the autoregressive generation of SID tokens.

Training Data:

CPT Data: Template-based prompts (Text-to-SID, attributes-to-SID, trajectory-to-SID).
SFT Data: Instruction-tuning dataset using user trajectories (short-term history L_s=32) and context.

Key Hyperparameters:

codebook_layers: 3
trajectory_length: 32
contrastive_temperature_tau: Not explicitly reported in the paper
+ 1 more
beam_search_width: 20 (during SID optimization)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TIGER/GNPR-SID: GeoGR explicitly incorporates spatio-temporal collaborative signals (geo-constrained co-visits) into the SID construction via contrastive learning, rather than relying solely on text or pure quantization.
vs. LLM4POI: GeoGR uses discrete Semantic IDs instead of full text generation, improving efficiency and handling the specific vocabulary of locations better.
vs. Standard GR: Uses an EM-style iterative refinement to align the SIDs with the LLM's capability, rather than keeping SIDs fixed after quantization.

Limitations

Relies on proprietary data and platform (AMAP), limiting reproducibility.
Requires complex multi-stage training (Contrastive learning -> Quantization -> Refinement -> CPT -> SFT).
Specifics of the offline experimental results (tables/numbers) are not provided in the snippet.

Reproducibility

No replication artifacts mentioned in the paper. The system is deployed on a proprietary platform (AMAP), and the dataset appears to be internal/proprietary real-world data.

📊 Experiments & Results

Evaluation Setup

Next-POI prediction on real-world datasets and online A/B testing on AMAP.

Benchmarks:

Internal/Real-world datasets (Next POI Prediction) [New]

Metrics:

Online engagement metrics (not specified exactly but implied CTR/Conversion)
Offline accuracy metrics (likely Hit@K, NDCG@K - not explicitly listed in snippet)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AMAP Platform	Online metrics	Not reported in the paper	Not reported in the paper	Positive boosting

Main Takeaways

GeoGR successfully integrates LLMs into a high-throughput industrial navigation platform.
The geo-aware SID tokenization effectively captures cross-category associations (e.g., airport -> hotel) that standard text embeddings miss.
The two-stage alignment (CPT + SFT) is crucial for adapting the LLM to non-native SID tokens.

📚 Prerequisite Knowledge

Prerequisites

Generative Retrieval (GR) paradigms
Vector Quantization (RQ-VAE / RQ-Kmeans)
Large Language Model fine-tuning (CPT, SFT)
Contrastive Learning (NCE loss)

Key Terms

SID: Semantic ID—a short sequence of discrete tokens representing an item (POI) in a generative retrieval system, often derived from hierarchical clustering.

CPT: Continued Pre-Training—an intermediate training stage to adapt a pre-trained LLM to a new domain or vocabulary before specific task fine-tuning.

SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs (instruction tuning) to perform the specific downstream task.

RQ-Kmeans: Residual Quantization K-means—a method to discretize continuous vectors into hierarchical discrete codes by iteratively clustering residuals.

Co-visit pairs: Pairs of POIs that appear together in user trajectories within a short time window, indicating a behavioral relationship.

Spatio-temporal collaborative signals: Information derived from the collective movement patterns of users over time and space, revealing relationships between locations beyond just semantic similarity.