MAS: Multi-Agent Systems—systems where multiple AI agents collaborate (e.g., through debate or voting) to solve complex tasks
LVLM: Large Vision-Language Model—AI models capable of processing and reasoning over both text and visual inputs simultaneously
MDT: Multidisciplinary Team—a medical term referring to a group of doctors from different specialties collaborating on a diagnosis
VLM-SJ: Semantic Judge—the proposed evaluation protocol using a high-capacity VLM (Qwen2.5-VL-32B) to assess semantic correctness rather than string matching
Rule-EM: Exact Match—a rigid metric requiring the model output to be character-for-character identical to the ground truth
Rule-MR: Multi-Regex—a metric using regular expressions to extract answers, which often fails on verbose agent outputs
Instruction-following fatigue: A phenomenon where agents in long interaction chains lose adherence to formatting constraints (e.g., 'answer with just A/B') while maintaining reasoning quality
Specialization Penalty: The observed performance drop when general-purpose MAS architectures are applied to highly specialized medical sub-domains
Zero-shot: Evaluating a model on tasks it has not been explicitly trained or fine-tuned for, relying on its pre-trained capabilities
Pareto frontier: The set of optimal trade-offs, here specifically referring to the balance between diagnostic accuracy and computational cost (tokens)