| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Human evaluation results showing TraceSIR's improvement over ClaudeCode across three domain scenarios. | ||||
| TraceBench (Deep Research) | ReportEval Score | Normalized Base | Base + 10.0% | +10.0% |
| TraceBench (Function Calling) | ReportEval Score | Normalized Base | Base + 13.0% | +13.0% |
| TraceBench (Agentic Coding) | ReportEval Score | Normalized Base | Base + 5.0% | +5.0% |
| LLM-as-a-judge evaluation results showing consistent trends with human evaluation. | ||||
| TraceBench (Agentic Coding) | ReportEval Score | Normalized Base | Base + 26.0% | +26.0% |
| TraceBench (Average) | ReportEval Score | Normalized Base | Base + 7.5% | +7.5% |