_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
PING: Prefix INjection Guard—the proposed method that prepends optimized natural language tokens to agent responses to induce refusal of harmful tasks
Agentic Fine-Tuning: The process of fine-tuning a general LLM on datasets of agent interactions (e.g., tool use, web navigation) to improve task performance
Refusal Rate: The percentage of harmful instructions that the model correctly declines to execute
Attack Success Rate: The percentage of harmful instructions that the model successfully executes (a failure of safety)
Linear Probe: A simple classifier (usually logistic regression) trained on the internal activations of a neural network to distinguish between classes (here, refusal vs. compliance)
Activation Steering: A technique to modify model behavior by adding a specific vector (derived from linear probes) to the model's internal activations during inference
WebArena-Lite: A benchmark for evaluating web navigation agents on benign tasks
MINT-ALFWorld: A benchmark for evaluating code generation agents on benign tasks
RedCode-Exec: A safety benchmark for code agents containing harmful instructions
WebDojo: A newly introduced safety benchmark for web navigation agents containing harmful instructions
Success Rate: The proportion of benign tasks completed successfully by the agent
LLM: Large Language Model—a deep learning model trained on vast amounts of text to generate human-like language