Bayesian Optimization with LLM-Based Acquisition Functions for Natural Language Preference Elicitation

📝 Paper Summary

Conversational Recommendation Systems Preference Elicitation (PE) Bayesian Optimization

PEBOL formulates natural language preference elicitation as a Bayesian Optimization problem, using Natural Language Inference to update beliefs and Thompson Sampling to guide LLMs in generating strategic queries.

Core Problem

Monolithic LLMs lack the decision-theoretic reasoning to balance exploration and exploitation in dialogues, while traditional Bayesian methods cannot handle natural language inputs or generate fluent queries.

Why it matters:

Current LLM-based recommenders often over-exploit known preferences or waste turns on irrelevant items due to a lack of formal lookahead planning
Providing all item descriptions to an LLM's context window for reasoning is computationally prohibitive for large catalogs
Traditional preference elicitation requires explicit ratings or comparisons, which are unnatural for users unfamiliar with the item set

Concrete Example: In a cold-start movie recommendation, a standard LLM might randomly ask 'Do you like comedy?' (inefficient). A traditional Bayesian system might ask 'Rate movie ID #593' (unnatural). PEBOL's Bayesian policy selects the most informative item (e.g., *The Matrix*), prompts the LLM to ask 'What do you think of sci-fi action movies like The Matrix?', and uses the user's text response to mathematically update probabilities for all similar movies.

Key Novelty

PEBOL (Preference Elicitation with Bayesian Optimization augmented LLMs)

Replaces the 'black box' reasoning of LLMs with a formal Bayesian Optimization loop (Thompson Sampling/UCB) to select which item to discuss next
Uses the LLM only as a 'translator': it converts the mathematically selected item into a natural language query and doesn't need to see the full item catalog
Employing a Natural Language Inference (NLI) model as a likelihood function, converting unstructured text responses into numerical probability updates for item utilities

Architecture

The iterative PEBOL workflow combining Bayesian belief updates with LLM query generation

Evaluation Highlights

+0.096 MRR@10 improvement (0.270 vs 0.174) over GPT-3.5 on the MovieLens-25M dataset after 10 dialogue turns
Achieves ~3x higher MRR@10 on the Amazon Books dataset (0.134) compared to GPT-3.5 (0.046)
Outperforms purely random selection strategies significantly (0.270 vs 0.003 MRR@10 on Movies), proving the value of the Bayesian selection strategy

Breakthrough Assessment

7/10

Novel bridging of rigorous Bayesian Optimization with the generative capabilities of LLMs for cold-start recommendation. Solves the context-window scaling issue nicely, though relies on simulated user evaluation.

⚙️ Technical Details

Problem Definition

Setting: Bayesian Optimization for finding an optimal item i* from a set I based on latent user utilities u

Inputs: Set of item descriptions x, natural language dialogue history H

Outputs: Next natural language query q, and eventually a ranked list of recommended items

Pipeline Flow

Observation Processing (NLI) → Belief Update (Bayes) → Item Selection (BO Strategy) → Query Generation (LLM)

System Modules

Belief Tracker

Maintains a Beta distribution (alpha, beta parameters) for every item in the catalog representing the probability the user likes it

Model or implementation: Beta Distribution Parameters

Policy Selector

Selects the next target item to query about based on current uncertainty and expected utility

Model or implementation: Thompson Sampling or UCB

Query Generator

Generates a natural language question focused on the selected target item

Model or implementation: GPT-3.5-Turbo / Gemini-Pro

Response Interpreter

Calculates the likelihood that the user's response implies a preference for specific items

Model or implementation: bart-large-mnli

Novel Architectural Elements

Decoupling the 'decider' (Bayesian Policy) from the 'speaker' (LLM), allowing the system to reason over infinite item sets without filling the LLM context window
Using NLI scores directly as likelihood observations in a Bayesian update loop for preference learning

Modeling

Base Model: GPT-3.5-Turbo and Gemini-Pro (for query generation)

Comparison to Prior Work

vs. Monolithic LLMs: PEBOL uses formal lookahead (BO) rather than implicit LLM reasoning; PEBOL does not require full item history in context
vs. G20Q: PEBOL generates open-ended natural language queries rather than selecting from fixed attribute lists
vs. PICASSO [not cited in paper]: PEBOL uses NLI for generic item descriptions rather than requiring pre-extracted feature hierarchies

Limitations

Assumes items are independent (factorized utility), ignoring correlations between similar items
Evaluation relies on simulated 'UserBots' rather than real human studies
The NLI observation model is computationally expensive if run on the entire catalog every turn (though the paper uses a localized update approximation)

Reproducibility

Code: https://github.com/D3Mlab/llm-pe

Code is publicly available at https://github.com/D3Mlab/llm-pe. Uses standard APIs (OpenAI, Google) and HuggingFace models (bart-large-mnli). Experiments rely on a UserBot simulation with assumed ground truth utilities.

📊 Experiments & Results

Evaluation Setup

Simulated dialogue with a UserBot that holds ground truth preferences and responds to system queries

Benchmarks:

MovieLens-25M (Movie Recommendation)
Amazon Books (Book Recommendation)

Metrics:

MRR@10 (Mean Reciprocal Rank)
SR@10 (Success Rate)
AT (Average Turns)
Statistical methodology: Averages over 50-100 dialogue simulations; standard deviations shown in plots but significance tests not explicitly reported textually

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against monolithic LLMs shows PEBOL's superior efficiency in identifying preferences within limited dialogue turns.
MovieLens-25M	MRR@10	0.174	0.270	+0.096
Amazon Books	MRR@10	0.046	0.134	+0.088
MovieLens-25M	MRR@10	0.003	0.270	+0.267

Experiment Figures

MRR@10 performance curves over 10 turns of dialogue for Movies and Books datasets

Main Takeaways

Formal Bayesian strategies (Thompson Sampling/UCB) significantly outperform monolithic LLM reasoning in cold-start preference elicitation
PEBOL remains robust to user noise (simulated misunderstanding or vague answers) compared to baselines
The approach scales better than context-stuffing methods because the LLM only sees the single targeted item description per turn
Thompson Sampling (PEBOL-TS) generally outperforms UCB (PEBOL-UCB) in this conversational setting

📚 Prerequisite Knowledge

Prerequisites

Bayesian Optimization (priors, posteriors, acquisition functions)
Reinforcement Learning basics (Exploration vs. Exploitation)
Natural Language Inference (Entailment/Contradiction)

Key Terms

NLI: Natural Language Inference—a model task determining if one text segment (hypothesis) logically follows from another (premise)

Thompson Sampling: A heuristic for choosing actions that addresses the exploration-exploitation dilemma by sampling from the posterior distribution of rewards

UCB: Upper Confidence Bound—an algorithm that selects actions with the highest potential value (mean + uncertainty margin) to encourage exploration

MRR: Mean Reciprocal Rank—a statistical measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness

Acquisition Function: In Bayesian Optimization, a function that calculates the utility of evaluating a specific point next, balancing exploration and exploitation

Cold Start: The scenario where the system has no prior interaction history for a user and must learn preferences from scratch