Evaluation Setup
Evaluation on standard QA and reasoning benchmarks using open-source models augmented with ODS
Benchmarks:
- SimpleQA (Factuality and short-answer QA)
- FRAMES (Multi-hop reasoning and information retrieval)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ODS variants (v1/v2) combined with DeepSeek-R1 consistently outperform or match proprietary state-of-the-art models. |
| FRAMES |
Accuracy |
65.6 |
75.3 |
+9.7
|
| SimpleQA |
Accuracy |
82.2 |
88.3 |
+6.1
|
| SimpleQA |
Accuracy |
82.4 |
88.3 |
+5.9
|
| FRAMES |
Accuracy |
30.1 |
75.3 |
+45.2
|
Main Takeaways
- ODS-v2 (CodeAct) generally outperforms ODS-v1 (ReAct), specifically achieving higher accuracy on both SimpleQA and FRAMES.
- Combining ODS with strong reasoning models like DeepSeek-R1 yields performance exceeding current proprietary leaders (GPT-4o Search, Perplexity Sonar).
- The sophisticated Open Search Tool (rephrasing/scraping) provides significant context quality improvements over raw SERP injection used in prior open-source tools.