Evaluation Setup
Zero-shot evaluation on multilingual and cross-lingual embedding tasks using models trained only on English data.
Benchmarks:
- LUSIFER Benchmark (Comprehensive multilingual embedding suite (Classification, Clustering, Reranking, Retrieval, STS)) [New]
- Cross-lingual Benchmark (Cross-lingual retrieval and STS (Belebele, MLQA, STS17, STS22, IndicCrosslingual))
Metrics:
- Accuracy (Classification)
- V-measure (Clustering)
- nDCG@10 (Retrieval)
- Pearson correlation (STS)
- MAP (Reranking)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main multilingual performance results across 14 languages, showing LUSIFER's superiority over English-centric and even some multilingual baselines. |
| LUSIFER Benchmark (Average) |
Average Score |
59.44 |
62.63 |
+3.19
|
| LUSIFER Benchmark (Telugu) |
Average Score |
Not explicitly reported in the paper |
Not explicitly reported in the paper |
+22.15
|
| Cross-lingual evaluation results showing strong transfer capabilities. |
| Cross-lingual Benchmark (Average) |
Average Score |
52.14 |
57.89 |
+5.75
|
| IndicCrosslingual |
Score |
21.92 |
43.40 |
+21.48
|
| MLQA Retrieval |
nDCG@10 |
31.54 |
36.68 |
+5.14
|
| Ablation studies validating the architecture choices. |
| LUSIFER Benchmark |
Average Score |
44.18 |
62.63 |
+18.45
|
| LUSIFER Benchmark |
Average Score |
56.74 |
62.63 |
+5.89
|
Main Takeaways
- Achieves state-of-the-art performance in 10 out of 14 evaluated languages, surpassing baselines that use proprietary synthetic data.
- The method is highly effective for low-resource languages (e.g., Telugu, Swahili) where English-centric models fail completely.
- Outperforms fully supervised multilingual models like BGE-M3 on several tasks without using any multilingual training data, validating the zero-shot alignment hypothesis.
- Two-stage training (Alignment -> Finetuning) is critical; skipping alignment or representation finetuning leads to significant performance drops.