A $^ 2$ Search: Ambiguity-Aware Question Answering with Reinforcement Learning

📝 Paper Summary

Open-domain Question Answering Ambiguity Handling in QA Reinforcement Learning for Search

A2SEARCH uses an automated data pipeline to find valid alternative answers for ambiguous questions, then trains a search agent via reinforcement learning to retrieve multiple correct answers simultaneously.

Core Problem

Standard QA benchmarks assume a single correct answer, penalizing models that find valid alternative answers to ambiguous questions, which leads to misleading reward signals during RL training.

Why it matters:

Real-world questions often have multiple valid answers depending on interpretation or reasoning paths
27.6% of questions in the MuSiQue benchmark actually admit multiple valid answers, but existing evaluations treat them as incorrect
Current RL pipelines reward only the single annotated 'gold' answer, systematically understating true model capabilities and discouraging thorough search

Concrete Example: For the question 'Who is the owner of the record label of the performer of What Kind of Love?', the benchmark lists only 'Warner Music Group'. However, the performer (Rodney Crowell) released works under multiple labels, making 'Sony Music Entertainment' (parent of Columbia Records) an equally valid answer that standard models are penalized for generating.

Key Novelty

Annotation-Free Ambiguity-Aware RL Training

Automated Data Pipeline: Detects ambiguity by generating diverse answer trajectories from multiple models and verifying them against evidence using LLM judges, without manual annotation
AnsF1 Reward: Replaces binary correct/incorrect rewards with an Answer-level F1 score that rewards coverage of multiple valid answers (recall) while penalizing hallucination (precision)

Architecture

The automated data construction pipeline for identifying alternative answers.

Evaluation Highlights

Achieves 48.4% AnsF1@1 on average across four multi-hop benchmarks with A2SEARCH-7B (single rollout), outperforming the much larger ReSearch-32B (46.2%)
Surpasses specialized baselines on AmbigQA (a human-annotated ambiguity benchmark) despite not being trained on it, showing robust generalization
Identifies alternative answers for 19.0% of training questions automatically, extending the effective supervision signal beyond single gold references

Breakthrough Assessment

8/10

Significant advance in handling ambiguity without human labeling. Demonstrates that acknowledging ambiguity improves general QA performance, outperforming larger models.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering where a question q may have a set of valid reference answers A = {ans*, A_alt}

Inputs: Natural language question q

Outputs: Predicted set of answers Ans_predict (can contain multiple distinct answers)

Pipeline Flow

Pipeline: Question Input → Agent Policy (LLM) → [Loop: Reasoning → Tool Call → Search Engine → Tool Response] → Final Answer Generation
Training Data Generation: Sampling → Filtering → Verification → Grouping

System Modules

Agent Policy

Generates reasoning steps, issues search queries, and synthesizes final answers

Model or implementation: Qwen2.5-7B/32B (Base or Instruct)

Search Tool

Retrieves relevant passages from Wikipedia based on queries

Model or implementation: Contriever / E5 embedding index (FAISS)

Verifier Ensemble

Verifies if generated alternative answers are supported by retrieved evidence

Model or implementation: Ensemble of Claude 3.5/3.7, OpenAI o3/o4-mini

Novel Architectural Elements

Offline pipeline utilizing trajectory sampling and multi-model verification to expand single-answer datasets into multi-answer training data

Modeling

Base Model: Qwen2.5-32B-Instruct (for data filtering), Qwen2.5-7B/3B (for training)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward relative to group baseline.

Formally: Maximize average of (advantage * clip_ratio)
Purpose: (Optional for Base models) Prevent entropy collapse.

Formally: Add entropy regularization term λ * H(policy)

Adaptation: Full model update (RL on top of SFT/Base)

Trainable Parameters: All parameters of the policy model

Training Data:

Source: MuSiQue, 2Wiki, NQ (Total ~50k questions)
Process: Sample 16 trajectories/question from 5 models → Filter redundant → Verify evidence → Group synonyms

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 256
max_context_length: 8192
+ 3 more
epochs: 4
rollout_size: 16
reward_alpha: 0.4

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: A2SEARCH rewards multiple valid answers via AnsF1 instead of binary reward for single match
vs. AmbigQA baselines: A2SEARCH discovers ambiguity automatically from standard datasets rather than relying on human annotations
vs. ReSearch: A2SEARCH resolves ambiguity in a single rollout by retrieving multiple answers, whereas ReSearch often requires multiple sampled rollouts to cover diverse answers

Limitations

Relies on proprietary LLMs (Claude, OpenAI) for the high-quality verification step in data construction
Entropy control required for base models to prevent collapse, adding hyperparameter complexity
Evaluation still largely depends on Exact Match, which may miss semantically correct but lexically distinct answers despite efforts to group aliases

Reproducibility

Code: https://github.com/zfj1998/A2Search

Code, data, and model weights available at https://github.com/zfj1998/A2Search. Uses proprietary models (Claude/OpenAI) for data construction verification steps.

📊 Experiments & Results

Evaluation Setup

Open-domain QA on 8 benchmarks, evaluated using Exact Match and LMJudge

Benchmarks:

MuSiQue (Multi-hop QA)
HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
AmbigQA (Ambiguous QA)
NQ (Open-domain QA)
TriviaQA (Open-domain QA)
PopQA (Open-domain QA)
Bamboogle (Multi-hop QA)

Metrics:

AnsF1@1 (Answer-level F1 score with 1 rollout)
Recall@1 (Recall of reference answers with 1 rollout)
AnsF1@3 (Expected F1 with 3 rollouts)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
A2SEARCH significantly outperforms baselines on multi-hop benchmarks using a single rollout, often surpassing larger models.
4 Multi-hop Benchmarks (Avg)	AnsF1@1	46.2	48.4	+2.2
4 Multi-hop Benchmarks (Avg)	AnsF1@1	39.3	48.4	+9.1
MuSiQue	AnsF1@1	53.0	62.3	+9.3
AmbigQA	AnsF1@1	47.6	51.4	+3.8
HotpotQA	AnsF1@1	45.6	49.5	+3.9

Experiment Figures

Training dynamics (AnsF1, Recall, Entropy) over epochs for A2SEARCH.

Answer count distribution in the constructed dataset.

Main Takeaways

Ambiguity-aware training allows the model to output multiple valid answers in a single rollout, achieving higher recall efficiently compared to sampling multiple rollouts from standard models.
The automated pipeline successfully identifies alternative answers in 19% of training questions, providing a richer signal than standard single-answer supervision.
The method generalizes to base models (not just instruct-tuned ones) when combined with entropy control, preventing early collapse.
Performance gains are most pronounced on datasets with high inherent ambiguity like MuSiQue (27.6% ambiguous questions) compared to cleaner datasets like 2Wiki.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically GRPO or PPO)
Retrieval-Augmented Generation (RAG)
Knowledge of QA metrics (Exact Match, F1)
LLM-based evaluation (LLM-as-a-Judge)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages of rollouts rather than a separate critic model

AnsF1: Answer-level F1 score—a metric calculating the harmonic mean of precision (valid answers / total predicted) and recall (matched references / total references)

rollout: A complete sequence of actions (reasoning, tool calls, answer generation) generated by the model during RL training

multi-hop QA: Questions requiring reasoning across multiple documents or steps to answer

agentic search: A setup where the model actively issues search queries and processes results in a multi-turn loop

trajectory sampling: Generating multiple different solution paths (trajectories) for a single question to explore possible answers

reference answer: The original 'gold' answer provided in the benchmark dataset

alternative answer: A valid answer different from the reference, discovered via the pipeline and verified by evidence

Exact Match: Evaluation metric checking if the predicted string exactly matches a ground truth string (after normalization)

LMJudge: Using a Large Language Model to evaluate if a predicted answer is semantically equivalent to the ground truth

entropy collapse: A failure mode in RL where the policy becomes deterministic too early, stopping exploration

tool-call: A specific action token sequence that triggers an external search engine