Evaluation Setup
User rating prediction (1-5 stars) and reasoning quality assessment
Benchmarks:
- Not explicitly named in text (User rating prediction)
Metrics:
- BLEU (for faithfulness)
- ROUGE (for faithfulness)
- METEOR (for coherence)
- BERTScore (for coherence)
- Human Judgement (Likert scale for Coherence, Faithfulness, Insightfulness)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Incorporating reasoning (CoT) into RecSys improves personalized tasks in both zero-shot and fine-tuning settings (qualitative claim, numbers not in text).
- Using larger models to generate reasoning data enhances the performance of smaller fine-tuned models (Distillation).
- Syntactic metrics (BLEU, ROUGE) are suitable proxies for assessing the 'Faithfulness' of LLM reasoning in RecSys.
- Metrics like METEOR and BERTScore are adept at measuring the 'Coherence' of generated reasoning.
- Rec-SAVER framework aligns with human judgment, allowing cost-effective evaluation without gold references.