Evaluation Setup
Genomics Question Answering on the GeneTuring benchmark
Benchmarks:
- GeneTuring (Genomics QA (Nomenclature, Location, Functional Analysis, Alignment))
Metrics:
- Accuracy (Average Score 0-100%)
- Computational Cost (Estimated via token counts)
- Efficiency (Inference time/resources)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of the NBA agentic framework against State-of-the-Art baselines on the GeneTuring benchmark. |
| GeneTuring |
Accuracy |
0.83 |
0.98 |
+0.15
|
| GeneTuring |
Accuracy |
0.44 |
0.98 |
+0.54
|
| GeneTuring |
Accuracy |
0.83 |
0.85 |
+0.02
|
Main Takeaways
- Small Language Models (3-10B parameters) can achieve SOTA performance (85-97%) when wrapped in a modular agentic framework, negating the need for 100B+ models.
- The agentic 'Divide and Conquer' architecture prevents the accuracy degradation typically seen when scaling down models in monolithic prompting setups like GeneGPT.
- The approach generalizes across diverse model families (Llama, Mistral, Qwen, etc.), proving the robustness of the architectural design.
- Local inference is viable, offering privacy benefits for clinical genomics data.