MAPS: A Multilingual Benchmark for Agent Performance and Security

📝 Paper Summary

Multilingual Agent Evaluation Agent Security

MAPS extends four major agentic benchmarks into 11 languages to reveal that non-English inputs significantly degrade agent performance and amplify security vulnerabilities.

Core Problem

Existing agentic AI benchmarks are almost exclusively English-only, failing to capture how reliability and security degrade when agents process non-English inputs.

Why it matters:

Global accessibility requires agents to function reliably for non-English speakers, who risk encountering errors even when the underlying tools are English-based
Agents act on the world via tools; misunderstanding a non-English query can lead to incorrect financial transactions or dangerous code execution
Prior multilingual LLM benchmarks focus on text generation quality, missing the specific failures in multi-step agentic reasoning, planning, and tool use

Concrete Example: A non-English speaker using a banking agent might issue a query in Spanish; the agent, struggling with the translation or reasoning, might execute an incorrect fund transfer or fail to detect a security-critical instruction that it would have caught in English.

Key Novelty

MAPS (Multilingual Agentic AI Benchmark Suite)

Unified multilingual extension of four diverse agent benchmarks (GAIA, SWE-Bench, MATH, Agent Security Benchmark) into 11 typologically diverse languages
Hybrid translation pipeline combining Neural Machine Translation (NMT) for structure preservation with LLM-based polishing for fluency, verified by native speakers
Evaluation framework that keeps environments (tools, docs) in English but translates inputs, isolating the 'Multilingual Effect' on reasoning and safety

Architecture

The hybrid translation pipeline combining NMT and LLM with verification loops

Evaluation Highlights

Significant degradation in performance and security observed across all 11 non-English languages compared to English baselines
Translation quality verified with 94.2% answerability rate by native speakers, ensuring performance drops are due to agent reasoning failures, not bad translations
Security vulnerabilities increase in multilingual settings, with agents becoming more prone to unsafe behaviors when prompted in non-English languages

Breakthrough Assessment

8/10

First comprehensive, multi-domain benchmark specifically targeting multilingual agentic performance and security. Establishes a critical testing ground for future global agent deployment.

⚙️ Technical Details

Problem Definition

Setting: Multilingual evaluation of agentic systems where instructions are in language L_t but the environment/tools remain in English

Inputs: Task instruction s in target language L_t (e.g., German, Hindi)

Outputs: Agent action sequence and final answer t (checked against English ground truth)

Pipeline Flow

Group 1: Dataset Construction & Translation
Group 2: Agent Execution
Group 3: Evaluation

System Modules

Base Dataset Selector (Group 1: Dataset Construction & Translation)

Selects high-quality agentic tasks from GAIA, SWE-Bench, MATH, and ASB

Model or implementation: N/A

Hybrid Translator (Group 1: Dataset Construction & Translation)

Translates instructions into 11 languages using NMT + LLM refinement with fallback logic

Model or implementation: Google Translate (NMT) + Command A (LLM)

Native Verifier (Group 1: Dataset Construction & Translation)

Human experts verify translation quality (adequacy, fluency, formatting, answerability)

Model or implementation: Human Native Speakers

Agent Runner

Executes standard agents on the multilingual tasks

Model or implementation: Leading open-source agents (specific to each sub-benchmark)

Novel Architectural Elements

Hybrid translation pipeline with explicit automated integrity/adequacy checks and fallback mechanisms to balance structure (NMT) and fluency (LLM)
Verification protocol centered on 'answerability' rather than just linguistic fluency to ensure task solvability

Modeling

Base Model: Varies by benchmark component (uses existing agent frameworks)

Comparison to Prior Work

vs. XTREME/FLORES: MAPS evaluates full agentic loops (tools, planning, memory), not just static text generation
vs. X-WebAgentBench: MAPS covers 4 distinct domains (reasoning, coding, math, security) rather than just web navigation
vs. MASSIVE-Agents: MAPS evaluates end-to-end task completion, not just the function calling step
+ 1 more
vs. WebMMU: MAPS includes a dedicated security evaluation (ASB extension) lacking in other multilingual agent benchmarks

Limitations

Environment (tools, docs) remains in English, which may not fully represent fully localized deployments
Focuses on translation of instructions only, not cultural adaptation of tasks
Manual verification covers 25% of the dataset (statistically significant but not exhaustive)
Dependent on the quality of base agents; if base agents fail in English, multilingual degradation is harder to measure

Reproducibility

Code: https://huggingface.co/datasets/Fujitsu-FRE/MAPS

📊 Experiments & Results

Evaluation Setup

Agents attempt tasks with instructions in one of 11 non-English languages (plus English baseline). Tools and environment remain in English.

Benchmarks:

MAPS-GAIA (Real-world reasoning and tool use) [New]
MAPS-SWE-Bench (Software engineering / Code generation) [New]
MAPS-MATH (Mathematical reasoning) [New]
MAPS-ASB (Agent security (adversarial robustness)) [New]

Metrics:

Success Rate / Accuracy
Security Violation Rate (for ASB)
Answerability (Translation Quality)
Statistical methodology: 95% confidence intervals calculated for human verification stats

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human verification confirms the translation pipeline produces highly solvable tasks across languages.
MAPS (All)	Answerability Rate	100.0	94.2	-5.8
MAPS (All)	Adequacy (1-5)	5.0	4.43	-0.57
MAPS (All)	Formatting Accuracy (1-5)	5.0	4.75	-0.25

Experiment Figures

Overview of the MAPS benchmark composition, showing the 4 source benchmarks and 11 target languages mapping to 805 tasks.

Main Takeaways

Agents consistently show performance degradation when instructions are translated from English, with severity correlating with the amount of translated input.
Security vulnerabilities are amplified in multilingual settings; agents are more likely to execute unsafe commands when prompted in non-English languages.
The 'Multilingual Effect' varies by task type, suggesting that tasks requiring precise technical definitions (like coding) might be more sensitive to translation nuances.
Answerability is a better predictor of agent success than pure linguistic fluency, highlighting the need for task-centric translation metrics.

📚 Prerequisite Knowledge

Prerequisites

Understanding of agentic workflows (planning, tool use)
Familiarity with standard agent benchmarks (GAIA, SWE-Bench, MATH)
Basic knowledge of machine translation concepts (NMT, adequacy, fluency)

Key Terms

NMT: Neural Machine Translation—automated translation using deep neural networks, known for preserving sentence structure but sometimes lacking nuance

GAIA: A benchmark for General AI Assistants that tests reasoning, tool use, and multi-modality in real-world scenarios

SWE-Bench: Software Engineering Benchmark—evaluates an agent's ability to resolve GitHub issues via code generation and editing

ASB: Agent Security Benchmark—evaluates agent robustness against adversarial attacks and safety violations

answerability: A custom metric measuring whether a translated task preserves enough meaning for a human expert to solve it correctly

Multilingual Effect: The measurable degradation in AI performance or safety when processing non-English inputs compared to English

adequacy: A translation quality metric assessing whether the meaning of the source text is fully preserved in the target

fluency: A translation quality metric assessing the grammatical and stylistic naturalness of the translated text