ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering

📝 Paper Summary

Domain-specific RAG (Music) Benchmark creation

The authors introduce a specialized music vector database (MusWikiDB) and a globally diverse artist benchmark (ArtistMus) to enable effective retrieval-augmented question answering in the music domain, significantly outperforming general-purpose models.

Core Problem

General-purpose LLMs lack specific music knowledge, leading to hallucinations on artist details, while existing music benchmarks focus on theory or audio rather than the biographical and historical metadata users actually query.

Why it matters:

LLMs frequently fail or hallucinate when asked about specific artist careers, discographies, or influences due to sparse pre-training data.
Traditional fine-tuning is computationally expensive and struggles to keep up with dynamic music information (new albums, milestones).
Existing resources are Western-centric and lack the structured metadata needed to evaluate artist-centric reasoning across global regions.

Concrete Example: When asking 'Which jazz artists influenced Thad Jones’s move to Copenhagen?', a standard LLM often provides generic or hallucinated responses. The proposed system retrieves specific biographical passages to answer accurately.

Key Novelty

MusWikiDB and ArtistMus Framework

MusWikiDB: A specialized vector database of 3.2M passages derived from 144K music-specific Wikipedia pages, optimized for music retrieval rather than general knowledge.
ArtistMus: A benchmark of 1,000 QA pairs covering 500 artists from 163 countries, balancing global representation to correct Western bias.
Validates RAG-style fine-tuning on (context, question, answer) triples to improve both factual recall and contextual reasoning.

Architecture

The construction process of MusWikiDB and the RAG inference pipeline.

Evaluation Highlights

Open-source models gain up to +56.8 percentage points in factual accuracy using RAG (Qwen3 8B: 35.0% → 91.8%).
MusWikiDB retrieval yields +6 percentage points higher accuracy and 40% faster retrieval than using the general Wikipedia corpus.
RAG-style fine-tuning on Llama 3.1 8B improves factual accuracy by +46.4 pp and outperforms standard QA fine-tuning on contextual reasoning.

Breakthrough Assessment

8/10

Addresses a significant gap in domain-specific RAG by providing both a dense retrieval corpus and a culturally diverse benchmark. The massive gains in factual accuracy demonstrate the necessity of domain-specific indexes.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Question Answering in the Music Domain

Inputs: Natural language question q about music artists

Outputs: Selected answer option (A, B, C, or D)

Pipeline Flow

Query Formulation
Retrieval (BM25 over MusWikiDB)
Reranking (BGE Reranker)
Generation (LLM)

System Modules

Retriever (Retrieval & Selection)

Retrieve relevant passages from MusWikiDB based on the input question

Model or implementation: BM25 (sparse retrieval)

Reranker (Retrieval & Selection)

Re-score retrieved passages to select the most relevant ones for generation

Model or implementation: bge-reranker-large

Generator

Generate the final answer using the retrieved context

Model or implementation: Various LLMs (e.g., Llama 3, Qwen3)

Novel Architectural Elements

Construction of MusWikiDB: A domain-specific retrieval index built by crawling Wikipedia to depth-3 from music root pages, filtered for music relevance and segmented specifically for RAG.

Modeling

Base Model: Llama 3.1 8B Instruct (primary model for ablation)

Training Method: Supervised Fine-Tuning (SFT) using LoRA

Objective Functions:

Purpose: Standard Language Modeling.

Formally: Minimize negative log-likelihood of the target tokens given the input.

Adaptation: LoRA (rank=16, alpha=16)

Trainable Parameters: LoRA adapters only

Training Data:

8K examples from MusWikiDB
Comparison between QA pairs (question, answer) and RAG-style triples (context, question, answer)

Key Hyperparameters:

learning_rate: 3e-5
batch_size: 2
gradient_accumulation_steps: 4
+ 4 more
epochs: 1
dropout: 0.1
weight_decay: 0.005
scheduler: cosine

Comparison to Prior Work

vs. ChatMusician: ChatMusician lacks retrieval capabilities and struggles with factual hallucinations; ArtistMus system uses RAG to ground answers.
vs. General Wikipedia RAG: MusWikiDB is 40% faster and yields higher accuracy because it filters out non-music noise and ensures dense domain coverage.
vs. TrustMus: ArtistMus focuses on artist metadata and global diversity (163 countries), whereas TrustMus derives from Grove Dictionary and focuses on Western-centric theory/history.

Limitations

Evaluation is limited to multiple-choice format, which may not fully reflect open-ended QA capabilities.
Reliance on Wikipedia as the primary knowledge source inherits Wikipedia's potential biases, despite efforts to balance global representation.
The fine-tuning experiments were conducted primarily on Llama 3.1 8B; generalization to other architectures is assumed but not exhaustively tested.

Reproducibility

Code: https://github.com/DaeyongKwon98/ArtistMus

Code and data are publicly available at https://github.com/DaeyongKwon98/ArtistMus. The MusWikiDB construction process is detailed, including crawl depth and filtering rules. The validation process for ArtistMus questions using GPT-4o and human verification is described.

📊 Experiments & Results

Evaluation Setup

Zero-shot vs. RAG vs. RAG+Rerank on multiple-choice questions.

Benchmarks:

ArtistMus (In-domain Music QA (Factual & Contextual)) [New]
TrustMus (Out-of-domain Music QA)

Metrics:

Accuracy (Exact Match of option A/B/C/D)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RAG significantly improves performance over Zero-shot baselines across various models on the ArtistMus benchmark, especially for factual questions.
ArtistMus (Factual)	Accuracy	35.0	91.8	+56.8
ArtistMus (Factual)	Accuracy	39.0	84.8	+45.8
ArtistMus (Contextual)	Accuracy	84.8	85.8	+1.0
Generalization to out-of-domain data (TrustMus) also shows improvements with the MusWikiDB retrieval system.
TrustMus	Accuracy	25.0	36.0	+11.0
Ablation on fine-tuning strategies shows that RAG-style fine-tuning (with context) is superior to standard QA fine-tuning.
ArtistMus (Contextual)	Accuracy	75.6	86.4	+10.8

Experiment Figures

Regional distribution of artists in the ArtistMus benchmark.

Main Takeaways

Retrieval is critical for factual music QA: RAG boosts accuracy by >40pp on factual questions, bridging the gap between open-source and proprietary models.
Domain-specific retrieval structure matters: MusWikiDB enables better performance than general Wikipedia dumps due to cleaner, music-focused indexing.
Factual vs. Contextual gap: RAG helps factual recall immensely but provides diminishing returns for contextual reasoning questions where the model's inherent reasoning capabilities are the bottleneck.
RAG-style training is superior: Training with (context, question, answer) triples prevents the overfitting seen in standard QA fine-tuning and teaches the model to utilize retrieved context effectively.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation) pipelines
Familiarity with vector databases and BM25 retrieval
Basic knowledge of LLM fine-tuning strategies (LoRA)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

BM25: A probabilistic retrieval function based on term frequency and inverse document frequency, used here for first-stage retrieval

MusWikiDB: The authors' proposed vector database containing 3.2M music-specific Wikipedia passages

ArtistMus: The authors' proposed benchmark dataset containing 1,000 questions about 500 globally diverse music artists

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

pp: Percentage points—the arithmetic difference between two percentages

Contextual reasoning: Questions requiring synthesis or inference across multiple pieces of information within a passage, rather than simple fact lookup

Reranker: A second-stage model that re-scores retrieved documents to improve the quality of the context provided to the generator