SE-PQA: Personalized Community Question Answering

📝 Paper Summary

Community Question Answering (cQA) Personalized Information Retrieval

SE-PQA introduces a large-scale, real-world dataset for personalized community question answering, featuring over 1 million questions with rich user interaction metadata, and demonstrates that simple personalization models significantly improve retrieval effectiveness.

Core Problem

Existing datasets for personalized search are either synthetic, lack rich user-level features, or have severe privacy/ethical issues (e.g., AOL logs), hindering the development of deep learning models for personalization.

Why it matters:

Lack of high-quality, large-scale public data prevents robust evaluation of neural personalization models.
Privacy concerns and anonymization in existing query logs often strip away the user context necessary for training effective personalizers.
Synthetically enriched datasets (like PERSON or Amazon product search) rely on strong assumptions that may not reflect real-world user behavior.

Concrete Example: In current datasets like AOLIA, user relevance is inferred from clicks without text content, or synthetic queries are generated from product hierarchies (e.g., 'photo digital camera lenses'). SE-PQA provides actual user questions and explicit 'best answer' selections, allowing models to distinguish which answer a specific user prefers among multiple correct ones.

Key Novelty

Large-Scale Real-World cQA Personalization Benchmark

Curates a massive dataset from 50 StackExchange communities with over 1 million questions, preserving rich social metadata (votes, tags, badges, user history).
Defines a 'Personalized TAG model' baseline that ranks answers higher if the answerer's history shares topical tags with the questioner's history, modeling interest alignment.

Architecture

Conceptual illustration of the StackExchange data structure used to build SE-PQA, highlighting relationships between Users, Questions, Answers, Votes, and Tags.

Evaluation Highlights

+8% improvement in MAP@100 on the personalized test set when adding the simple TAG personalization model to a T5-base re-ranker.
Personalization yields statistically significant gains for all tested neural models (DistilBERT, MiniLM, MonoT5) on the personalized dataset version.
Multi-domain personalization (training across 50 communities) is more effective than single-domain personalization, which fails to improve performance in 25 out of 50 individual communities.

Breakthrough Assessment

7/10

Significant contribution as a resource paper filling a major gap in public personalized IR datasets. The modeling contribution (TAG) is simple but effective, serving primarily to validate the dataset's utility.

⚙️ Technical Details

Problem Definition

Setting: Ad-hoc retrieval task where the question is the query and the system retrieves relevant answers from a pool of historical answers.

Inputs: User question q asked by user u

Outputs: Ranked list of answers {a_1, ..., a_k}

Pipeline Flow

First Stage: BM25 Retrieval (recall-oriented)
Second Stage: Neural Re-ranking + Personalization Score

System Modules

First Stage Retriever

Select candidate documents efficiently

Model or implementation: BM25 (Elasticsearch)

Neural Re-ranker (Ranking)

Re-rank candidates based on semantic relevance

Model or implementation: MonoT5-base (with Adapters) or MiniLM / DistilBERT

TAG Personalizer (Ranking)

Compute personalization score based on tag overlap between users

Model or implementation: Jaccard-like similarity on tag sets

Novel Architectural Elements

Integration of folksonomy-based user profiles (tag history) directly into the re-ranking score via linear combination with neural relevance scores

Modeling

Base Model: MonoT5-base (finetuned from T5-base)

Training Method: Supervised Fine-Tuning with Triplet Margin Loss (DistilBERT) or Sequence-to-Sequence Ranking (MonoT5)

Objective Functions:

Purpose: Optimize ranking by distinguishing positive from negative answers.

Formally: Triplet Margin Loss with margin γ=0.5 (for DistilBERT).
Purpose: Optimize adapter parameters efficiently.

Formally: MonoT5 generation probability for 'true' vs 'false' token.

Adaptation: Adapter modules (intermediate dimension 48)

Trainable Parameters: Adapters only for MonoT5-base; Full fine-tuning for DistilBERT/MonoT5-small

Training Data:

Train: 2008-2019 data (822k questions)
Val: 2020 data (78k questions)
Test: 2021-2022 data (99k questions)

Key Hyperparameters:

learning_rate: 1e-6 (DistilBERT), 1e-3 (MonoT5-small)
batch_size: 16 (DistilBERT), 128 (MonoT5-small), 64 (MonoT5-base)
epochs: 10
+ 1 more
triplet_margin: 0.5

Compute: Not reported in the paper

Comparison to Prior Work

vs. AOLIA: Provides full text content and explicit user relevance (best answer selection) vs. URL-only logs
vs. PERSON/Amazon: Real-world questions vs. synthetic queries
vs. Yandex: Publicly available for training vs. anonymized/restricted

Limitations

TAG model relies on explicit user tagging history, which may be sparse for new users
Personalization improvements are inconsistent across single-domain communities (only 12/50 showed significant gains)
Evaluation is limited to re-ranking top-100 BM25 results, potentially missing relevant documents with low lexical overlap

Reproducibility

Code: https://github.com/pkasela/SE-PQA

Dataset available on Zenodo. Code available on GitHub. Baseline models (MiniLM) used off-the-shelf; others fine-tuned. Detailed split dates provided.

📊 Experiments & Results

Evaluation Setup

Re-ranking top-100 BM25 results. Two dataset versions: 'Base' (all answers with positive score are relevant) and 'Pers' (only user-selected 'best answer' is relevant).

Benchmarks:

SE-PQA Base (Ad-hoc Retrieval) [New]
SE-PQA Pers (Personalized Ad-hoc Retrieval) [New]

Metrics:

MAP@100
NDCG@10
P@1
Recall@100
Statistical methodology: Bonferroni-corrected two-sided paired student’s t-test with 99% confidence

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on the personalized (Pers) dataset version, where relevance is strictly the user-selected best answer, show significant gains from the TAG model.
SE-PQA Pers	MAP@100	0.510	0.528	+0.018
SE-PQA Pers	P@1	0.417	0.440	+0.023
SE-PQA Pers	NDCG@10	0.525	0.543	+0.018
Results on the Base dataset version (relevance = any positive score) show smaller but still significant gains.
SE-PQA Base	MAP@100	0.443	0.457	+0.014

Main Takeaways

Personalization (TAG model) consistently improves effectiveness across all underlying neural models (DistilBERT, MiniLM, T5) on the Pers dataset.
Deep learning models (T5, MiniLM) significantly outperform traditional BM25, with T5-base achieving the highest absolute performance.
Multi-domain data is crucial for this personalization approach; improvements vanish or are insignificant in 38 out of 50 single-community subsets, suggesting user history from diverse domains aids in modeling interests.

📚 Prerequisite Knowledge

Prerequisites

Information Retrieval basics (BM25, re-ranking)
Neural ranking models (Cross-Encoders, Bi-Encoders)
Community Question Answering (cQA) structure

Key Terms

cQA: Community Question Answering—platforms like StackExchange where users ask questions and others answer, with community voting.

Folksonomy: A system of classification derived from the practice and method of collaboratively creating and managing tags to annotate and categorize content.

BM25: A probabilistic retrieval function used as a standard baseline for ranking documents based on term frequency and saturation.

Hard negatives: Documents that are not relevant but are similar enough to the query to be difficult for a model to distinguish from relevant ones.

MonoT5: A T5 (Text-to-Text Transfer Transformer) model fine-tuned for document ranking by generating 'true' or 'false' tokens.

Adapter modules: Lightweight, trainable layers inserted into pre-trained models to adapt them to new tasks without fine-tuning the entire network.

MAP: Mean Average Precision—a metric that calculates the average precision for each query and then averages across all queries.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items.