← Back to Paper List

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie
The Fin AI, Georgia Institute of Technology, Columbia University, California State University, Mohamed bin Zayed University of Artificial Intelligence, University of Montreal, McGill University
arXiv (2026)
Benchmark Recommendation P13N Memory

📝 Paper Summary

Financial Recommendation Conversational AI User Modeling
Conv-FinRe is a benchmark that evaluates financial LLM advisors not just by how well they mimic user choices, but by how well they align with the user's latent financial utility and risk tolerance over time.
Core Problem
Existing financial recommendation benchmarks rely on behavioral imitation (mimicking user clicks/trades), but in finance, user actions are often noisy, emotional, or short-sighted, making them a poor proxy for true decision quality.
Why it matters:
  • Faithful mimicking of noisy actions may align with bad financial habits rather than the user's long-term goals.
  • Current benchmarks cannot distinguish whether an LLM is reasoning rationally, blindly chasing market momentum, or overfitting to user idiosyncrasies.
  • Financial advisors must balance adhering to user instructions with providing normative guidance based on risk tolerance, a nuance missing from simple relevance-based evaluations.
Concrete Example: A user might panic and sell a solid stock during short-term volatility. A model trained only on behavioral imitation would recommend selling (matching the error), whereas a utility-grounded model should recognize the user's long-term risk profile and recommend holding or buying.
Key Novelty
Multi-View Utility-Grounded Evaluation
  • Evaluates recommendations against four distinct reference rankings: User Choice (empirical), Rational Utility (theoretical optimum), Market Momentum (trend-chasing), and Risk Sensitivity (safety-focused).
  • Uses Inverse Optimization to infer latent user risk parameters (sensitivity to volatility and drawdown) from longitudinal behavior, creating a 'ground truth' utility function that is hidden from the model.
Architecture
Architecture Figure Figure 1
The Conv-FinRe pipeline: Data Collection (User Profiling, Asset Simulation), Conversation Simulation, and Multi-View Evaluation.
Evaluation Highlights
  • Reveals a tension between alignment and utility: Models like GPT-4o often achieve higher utility-based rankings (uNDCG) but lower behavioral alignment (MRR) compared to domain-specialized models.
  • Specialized financial models (e.g., Llama3-XuanYuan3-70B) tend to overfit noisy user actions, mistaking transient emotional decisions for stable preferences.
  • General-purpose models often conflate long-term risk management with short-term market momentum, performing well on momentum baselines but failing to capture specific risk sensitivities.
Breakthrough Assessment
8/10
Significant methodological shift from 'behavior-as-truth' to 'utility-as-truth' in recommender systems. The use of inverse optimization to construct latent ground truth is a novel approach to evaluating rationality vs. imitation.
×