Evaluation Setup
Multi-step QA with heterogeneous retrieval tools (Web Search, Browser, Local Search, KG)
Benchmarks:
- HotpotQA (Multi-hop QA)
- 2WikiMultihopQA (Multi-hop QA)
- Bamboogle (Distractor-robust QA)
Metrics:
- Answer Accuracy (EM/F1)
- Citation Precision/Recall
- Tool-Use Validity
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| PoU consistently outperforms baselines on standard multi-hop QA benchmarks, showing that constraining agents with evidence protocols does not degrade performance but rather enhances it. |
| HotpotQA |
EM |
Not reported in the paper |
Not reported in the paper |
Positive qualitative gain
|
Main Takeaways
- PoU successfully suppresses Tool-Call Hacking: agents trained with PoU do not collapse into single-tool overuse or decorative citations.
- The adaptive reward mixing strategy is crucial for stabilizing training, allowing the agent to graduate from dense process rewards to sparse outcome rewards.
- Emergent robustness: PoU agents adapt better to domain shifts and tool changes than agents trained with simple outcome supervision, despite not being explicitly optimized for transfer.