Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

📝 Paper Summary

Financial Recommendation Conversational AI User Modeling

Conv-FinRe is a benchmark that evaluates financial LLM advisors not just by how well they mimic user choices, but by how well they align with the user's latent financial utility and risk tolerance over time.

Core Problem

Existing financial recommendation benchmarks rely on behavioral imitation (mimicking user clicks/trades), but in finance, user actions are often noisy, emotional, or short-sighted, making them a poor proxy for true decision quality.

Why it matters:

Faithful mimicking of noisy actions may align with bad financial habits rather than the user's long-term goals.
Current benchmarks cannot distinguish whether an LLM is reasoning rationally, blindly chasing market momentum, or overfitting to user idiosyncrasies.
Financial advisors must balance adhering to user instructions with providing normative guidance based on risk tolerance, a nuance missing from simple relevance-based evaluations.

Concrete Example: A user might panic and sell a solid stock during short-term volatility. A model trained only on behavioral imitation would recommend selling (matching the error), whereas a utility-grounded model should recognize the user's long-term risk profile and recommend holding or buying.

Key Novelty

Multi-View Utility-Grounded Evaluation

Evaluates recommendations against four distinct reference rankings: User Choice (empirical), Rational Utility (theoretical optimum), Market Momentum (trend-chasing), and Risk Sensitivity (safety-focused).
Uses Inverse Optimization to infer latent user risk parameters (sensitivity to volatility and drawdown) from longitudinal behavior, creating a 'ground truth' utility function that is hidden from the model.

Architecture

The Conv-FinRe pipeline: Data Collection (User Profiling, Asset Simulation), Conversation Simulation, and Multi-View Evaluation.

Evaluation Highlights

Reveals a tension between alignment and utility: Models like GPT-4o often achieve higher utility-based rankings (uNDCG) but lower behavioral alignment (MRR) compared to domain-specialized models.
Specialized financial models (e.g., Llama3-XuanYuan3-70B) tend to overfit noisy user actions, mistaking transient emotional decisions for stable preferences.
General-purpose models often conflate long-term risk management with short-term market momentum, performing well on momentum baselines but failing to capture specific risk sensitivities.

Breakthrough Assessment

8/10

Significant methodological shift from 'behavior-as-truth' to 'utility-as-truth' in recommender systems. The use of inverse optimization to construct latent ground truth is a novel approach to evaluating rationality vs. imitation.

⚙️ Technical Details

Problem Definition

Setting: Multi-view Longitudinal Stock Recommendation: Iterative interaction between an advisor and user over a fixed horizon T.

Inputs: User onboarding interview P, current market state M_t, and longitudinal interaction history H_{1:t-1} containing past dialogues and decisions.

Outputs: A ranked list of stocks pi_{i,t} from the candidate set S_t.

Pipeline Flow

User Profiling (Questionnaire & Onboarding)
Asset Simulation (Longitudinal Data Collection)
Preference Inference (Inverse Optimization)
Conversation Simulation (Dialogue Generation)
Multi-View Evaluation

System Modules

User Profiler (Data Generation)

Captures static user demographics, financial goals, and risk attitudes via structured questionnaires

Model or implementation: Rules/Scripts based on MiFID II/FINRA guidelines

Asset Simulator (Data Generation)

Collects longitudinal decision trajectories where users interact with stocks over 30 days

Model or implementation: Custom simulation tool (LetYourProfitsRun)

Preference Inferrer

Estimates user-specific risk parameters (lambda, gamma) from behavior using Inverse Optimization

Model or implementation: Regularized Negative Log-Likelihood Minimization

Dialogue Generator (Data Generation)

Converts profiles and traces into multi-turn advisory conversations

Model or implementation: LLM-based generator (implied, specifics not detailed for generator itself)

Novel Architectural Elements

Latent Preference Grounding via Inverse Optimization: Decoupling the evaluation 'ground truth' from the observed user behavior by inferring underlying utility parameters.

Modeling

Base Model: Llama-3.3-70B-Instruct

Compute: Not reported in the paper

Comparison to Prior Work

vs. FinMem/FinGPT: Conv-FinRe evaluates utility alignment via inverse optimization rather than just prediction accuracy or trading profit.
vs. PIE: Focuses on longitudinal decision quality and risk preferences rather than just explanation quality or single-turn relevance.
vs. General RecSys (e.g., MovieLens): Addresses non-stationary assets and objective utility (risk/return) rather than purely subjective taste.

Limitations

Small scale of human participants (10 users) limits the diversity of behavioral phenotypes.
Stock universe is restricted to 10 representative S&P 500 stocks, potentially simplifying market dynamics.
Reliance on Yahoo Finance data for the specific period (Aug-Sep 2025) may not generalize to different market regimes (e.g., crash or boom).
Assumes user behavior can be modeled by a specific utility function form (Mean-Variance-Drawdown), which may not capture all irrationalities.

Reproducibility

Code: https://github.com/The-FinAI/Conv-FinRe

Dataset publicly released on Hugging Face (TheFinAI/conv-finre). Codebase available on GitHub (The-FinAI/Conv-FinRe). Simulation tool available on Hugging Face Spaces. Raw market data from Yahoo Finance API. Specific prompts for the conversation generation are part of the codebase.

📊 Experiments & Results

Evaluation Setup

Conversational stock recommendation over a 30-day simulated horizon.

Benchmarks:

Conv-FinRe (Longitudinal Stock Recommendation) [New]

Metrics:

Utility-based NDCG (uNDCG)
Mean Reciprocal Rank (MRR)
Hit Rate @ K (HR@1, HR@3)
Expert Alignment Score (EAS)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

There is a persistent trade-off between rational decision quality (uNDCG) and behavioral alignment (MRR); models rarely excel at both simultaneously.
Domain-specialized models (e.g., Llama3-XuanYuan3) tend to act as 'sycophants,' mimicking user noise rather than correcting for risk, leading to high MRR but lower utility scores.
General models (e.g., GPT-4o) are better at following rational utility principles but may fail to personalize to the specific user's behavioral idiosyncrasies.
Models often confuse long-term risk parameters with short-term market momentum, indicating a gap in true understanding of financial risk tolerance.

📚 Prerequisite Knowledge

Prerequisites

Financial Utility Theory (Return vs. Risk)
Recommender Systems metrics (NDCG, MRR)
Inverse Optimization
Modern Portfolio Theory basics (Alpha, Beta, Drawdown)

Key Terms

uNDCG: Utility-based Normalized Discounted Cumulative Gain—a metric measuring how well a ranking aligns with a user's theoretical utility function rather than just their historical choices.

Inverse Optimization: A method to infer the parameters of an optimization problem (here, the user's utility function) given the observed optimal solutions (the user's choices).

Drawdown: The peak-to-trough decline during a specific record period of an investment, used as a measure of downside risk.

EAS: Expert Alignment Score—a metric using Kendall’s Tau to measure how closely a model's ranking matches a specific expert strategy (e.g., Momentum or Risk-Safety).

GICS: Global Industry Classification Standard—an industry taxonomy used to categorize stocks.

Market Beta: A measure of the volatility—or systematic risk—of a security or portfolio compared to the market as a whole.

Multinomial Logit: A probabilistic model used to predict the probabilities of different possible outcomes of a categorically distributed dependent variable.