Evaluation Setup
Manual qualitative evaluation of chatbot responses to real-world questions from Infineon's developer community
Benchmarks:
- Infineon Developer Community Questions (Domain-specific QA (Technical, Sales, Customer Support)) [New]
Metrics:
- Accuracy (human evaluated)
- Relevance (human evaluated)
- Comprehensiveness (human evaluated)
- Runtime (seconds)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Latency comparison shows RAG-Fusion is significantly slower than standard RAG due to additional processing steps. |
| Runtime Comparison |
Average Query-to-Output Time (seconds) |
19.52 |
34.62 |
+15.10
|
Main Takeaways
- RAG-Fusion provides more comprehensive answers than human experts for certain technical questions by proactively explaining related concepts (e.g., explaining IP ratings rather than just stating them).
- The method successfully generates sales strategies by combining technical datasheet info with general sales logic.
- A major trade-off is latency: the approach is nearly twice as slow as standard RAG, primarily due to the second API call to the LLM.
- The system struggles with negative constraints (e.g., 'Does X have sleep mode?'), often defaulting to 'uncertain' rather than a definitive 'no' when documents lack the specific keyword.
- Prompt engineering is sometimes required from the user side to prevent the query generator from misinterpreting intent (e.g., confusing 'good for a camera' with 'is a camera').