Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie
The Fin AI,
Georgia Institute of Technology,
Columbia University,
California State University,
Mohamed bin Zayed University of Artificial Intelligence,
University of Montreal,
McGill University
Conv-FinRe is a benchmark that evaluates financial LLM advisors not just by how well they mimic user choices, but by how well they align with the user's latent financial utility and risk tolerance over time.
Core Problem
Existing financial recommendation benchmarks rely on behavioral imitation (mimicking user clicks/trades), but in finance, user actions are often noisy, emotional, or short-sighted, making them a poor proxy for true decision quality.
Why it matters:
Faithful mimicking of noisy actions may align with bad financial habits rather than the user's long-term goals.
Current benchmarks cannot distinguish whether an LLM is reasoning rationally, blindly chasing market momentum, or overfitting to user idiosyncrasies.
Financial advisors must balance adhering to user instructions with providing normative guidance based on risk tolerance, a nuance missing from simple relevance-based evaluations.
Concrete Example:A user might panic and sell a solid stock during short-term volatility. A model trained only on behavioral imitation would recommend selling (matching the error), whereas a utility-grounded model should recognize the user's long-term risk profile and recommend holding or buying.
Key Novelty
Multi-View Utility-Grounded Evaluation
Evaluates recommendations against four distinct reference rankings: User Choice (empirical), Rational Utility (theoretical optimum), Market Momentum (trend-chasing), and Risk Sensitivity (safety-focused).
Uses Inverse Optimization to infer latent user risk parameters (sensitivity to volatility and drawdown) from longitudinal behavior, creating a 'ground truth' utility function that is hidden from the model.
Architecture
The Conv-FinRe pipeline: Data Collection (User Profiling, Asset Simulation), Conversation Simulation, and Multi-View Evaluation.
Evaluation Highlights
Reveals a tension between alignment and utility: Models like GPT-4o often achieve higher utility-based rankings (uNDCG) but lower behavioral alignment (MRR) compared to domain-specialized models.
Specialized financial models (e.g., Llama3-XuanYuan3-70B) tend to overfit noisy user actions, mistaking transient emotional decisions for stable preferences.
General-purpose models often conflate long-term risk management with short-term market momentum, performing well on momentum baselines but failing to capture specific risk sensitivities.
Breakthrough Assessment
8/10
Significant methodological shift from 'behavior-as-truth' to 'utility-as-truth' in recommender systems. The use of inverse optimization to construct latent ground truth is a novel approach to evaluating rationality vs. imitation.
⚙️ Technical Details
Problem Definition
Setting: Multi-view Longitudinal Stock Recommendation: Iterative interaction between an advisor and user over a fixed horizon T.
Inputs: User onboarding interview P, current market state M_t, and longitudinal interaction history H_{1:t-1} containing past dialogues and decisions.
Outputs: A ranked list of stocks pi_{i,t} from the candidate set S_t.
Pipeline Flow
User Profiling (Questionnaire & Onboarding)
Asset Simulation (Longitudinal Data Collection)
Preference Inference (Inverse Optimization)
Conversation Simulation (Dialogue Generation)
Multi-View Evaluation
System Modules
User Profiler (Data Generation)
Captures static user demographics, financial goals, and risk attitudes via structured questionnaires
Model or implementation: Rules/Scripts based on MiFID II/FINRA guidelines
Asset Simulator (Data Generation)
Collects longitudinal decision trajectories where users interact with stocks over 30 days
Model or implementation: Custom simulation tool (LetYourProfitsRun)
Preference Inferrer
Estimates user-specific risk parameters (lambda, gamma) from behavior using Inverse Optimization
Model or implementation: Regularized Negative Log-Likelihood Minimization
Dialogue Generator (Data Generation)
Converts profiles and traces into multi-turn advisory conversations
Model or implementation: LLM-based generator (implied, specifics not detailed for generator itself)
Novel Architectural Elements
Latent Preference Grounding via Inverse Optimization: Decoupling the evaluation 'ground truth' from the observed user behavior by inferring underlying utility parameters.
Modeling
Base Model: Llama-3.3-70B-Instruct
Compute: Not reported in the paper
Comparison to Prior Work
vs. FinMem/FinGPT: Conv-FinRe evaluates utility alignment via inverse optimization rather than just prediction accuracy or trading profit.
vs. PIE: Focuses on longitudinal decision quality and risk preferences rather than just explanation quality or single-turn relevance.
vs. General RecSys (e.g., MovieLens): Addresses non-stationary assets and objective utility (risk/return) rather than purely subjective taste.
Limitations
Small scale of human participants (10 users) limits the diversity of behavioral phenotypes.
Stock universe is restricted to 10 representative S&P 500 stocks, potentially simplifying market dynamics.
Reliance on Yahoo Finance data for the specific period (Aug-Sep 2025) may not generalize to different market regimes (e.g., crash or boom).
Assumes user behavior can be modeled by a specific utility function form (Mean-Variance-Drawdown), which may not capture all irrationalities.
Dataset publicly released on Hugging Face (TheFinAI/conv-finre). Codebase available on GitHub (The-FinAI/Conv-FinRe). Simulation tool available on Hugging Face Spaces. Raw market data from Yahoo Finance API. Specific prompts for the conversation generation are part of the codebase.
📊 Experiments & Results
Evaluation Setup
Conversational stock recommendation over a 30-day simulated horizon.
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
There is a persistent trade-off between rational decision quality (uNDCG) and behavioral alignment (MRR); models rarely excel at both simultaneously.
Domain-specialized models (e.g., Llama3-XuanYuan3) tend to act as 'sycophants,' mimicking user noise rather than correcting for risk, leading to high MRR but lower utility scores.
General models (e.g., GPT-4o) are better at following rational utility principles but may fail to personalize to the specific user's behavioral idiosyncrasies.
Models often confuse long-term risk parameters with short-term market momentum, indicating a gap in true understanding of financial risk tolerance.
📚 Prerequisite Knowledge
Prerequisites
Financial Utility Theory (Return vs. Risk)
Recommender Systems metrics (NDCG, MRR)
Inverse Optimization
Modern Portfolio Theory basics (Alpha, Beta, Drawdown)
Key Terms
uNDCG: Utility-based Normalized Discounted Cumulative Gain—a metric measuring how well a ranking aligns with a user's theoretical utility function rather than just their historical choices.
Inverse Optimization: A method to infer the parameters of an optimization problem (here, the user's utility function) given the observed optimal solutions (the user's choices).
Drawdown: The peak-to-trough decline during a specific record period of an investment, used as a measure of downside risk.
EAS: Expert Alignment Score—a metric using Kendall’s Tau to measure how closely a model's ranking matches a specific expert strategy (e.g., Momentum or Risk-Safety).
GICS: Global Industry Classification Standard—an industry taxonomy used to categorize stocks.
Market Beta: A measure of the volatility—or systematic risk—of a security or portfolio compared to the market as a whole.
Multinomial Logit: A probabilistic model used to predict the probabilities of different possible outcomes of a categorically distributed dependent variable.