Evaluation Setup
Dynamic benchmarking where the system searches for errors in a target LLM using Wikipedia as a knowledge source.
Benchmarks:
- Wikipedia Knowledge Base (Factuality / Question Answering) [New]
Metrics:
- Number of discovered errors
- Error Rate (proportion of questions answered incorrectly)
- Cost per Error (API calls or budget unit per error found)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison with Automated Capability Discovery (ACD) shows massive improvements in total errors found under fixed budget. |
| Wikipedia (Dynamic) |
Number of Errors (Ratio SEA/ACD) |
1.0 |
55.83 |
+54.83
|
| Wikipedia (Dynamic) |
Cost per Error Reduction |
1.0 |
599.0 |
598.0
|
| Comparison with AutoBencher on error rate efficiency. |
| Wikipedia (Dynamic) |
Error Rate |
0.26 |
0.42 |
+0.16
|
| Wikipedia (Dynamic) |
Average Error Rate |
0.30 |
0.38 |
+0.08
|
| Wikipedia (Dynamic) |
Cost per Error Reduction |
1.0 |
9.0 |
8.0
|
Main Takeaways
- SEA consistently discovers more errors than baselines by actively following error gradients via semantic similarity.
- The method is highly cost-effective, drastically reducing the number of queries needed to find a specific number of failures.
- Error analysis reveals strong intra-family correlations (e.g., GPT-4o models share failure patterns), but o1-mini behaves differently.
- Models like DeepSeek-V3 struggle on subsets where GPT-4o performs well, highlighting model-specific knowledge gaps.