Evaluation Setup
Comprehensive evaluation across text-based, image-based, and retrieval tasks using the MM-Telco benchmark
Benchmarks:
- MM-Telco (Text) (MCQs, Multihop MCQs, Long-Answer QA, RAG) [New]
- MM-Telco (Image) (Image Classification, Image Retrieval, Image Captioning, Image Generation/Correction) [New]
- MM-Telco (PCAP) (Network troubleshooting via packet capture analysis) [New]
Metrics:
- SEM score (Cosine similarity)
- Retrieval Accuracy (Top-K)
- Classification Accuracy
- LLM-as-a-judge scores (0-100)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The paper constructs a large-scale benchmark dataset. The following entries represent the scale and diversity of the constructed data, as performance results are not included in the provided text. |
| MM-Telco (Multihop MCQ) |
Number of Samples |
0 |
2000 |
+2000
|
| MM-Telco (PCAP Analysis) |
Number of Samples |
0 |
500 |
+500
|
| MM-Telco (Image Retrieval) |
Number of Images |
0 |
3766 |
+3766
|
| MM-Telco (Long Answer) |
Number of Samples |
0 |
1500 |
+1500
|
| MM-Telco (Named Entity) |
Number of Entities |
0 |
1000 |
+1000
|
Main Takeaways
- Constructed a structured knowledge graph from 3GPP Release 17 to preserve semantic continuity and cross-references often lost in naive chunking
- Identified that general-purpose LLMs struggle with distinguishing between 3GPP releases, motivating the need for this specialized benchmark
- Developed a novel task for Telecom Image Generation/Correction using Mermaid.js code, addressing the specific need for accurate network diagramming