ASMR: Aggregated Semantic Matching Retrieval Unleashing Commonsense Ability of LLM through Open-Ended Question Answering

📝 Paper Summary

Commonsense Reasoning Multiple-Choice Question Answering

ASMR improves multiple-choice commonsense reasoning by first prompting an LLM to generate open-ended answers, then using those answers to semantically retrieve the most relevant choices before final prediction.

Core Problem

Providing all answer choices directly to an LLM in multiple-choice tasks restricts the model's potential to access its internal commonsense knowledge, often leading to distraction by plausible incorrect options.

Why it matters:

LLMs frequently suffer from biases like majority label bias and recency bias when presented with multiple choices
Direct multiple-choice prompting limits the model to comparing given options rather than first formulating its own understanding, unlike human reasoning processes

Concrete Example: Question: 'Why are dogs often known as man’s best friend?' Choices: A. aggressive, B. friendly, C. very loyal... A standard LLM (MCP) incorrectly picks 'B. friendly' because it seems plausible. ASMR first generates the open-ended answer 'loyal', which matches choice C, leading the model to correctly select 'C. very loyal'.

Key Novelty

Aggregated Semantic Matching Retrieval (ASMR)

Mimic human reasoning by generating a preliminary open-ended answer before seeing the choices
Use the semantic similarity between the generated open-ended answer and the provided options to filter/retrieve the most relevant choices
Re-prompt the LLM with only the top-k semantically relevant choices to make the final decision

Architecture

The proposed ASMR framework workflow compared to standard approaches

Evaluation Highlights

+15.3% accuracy improvement over the previous SOTA (MCP) on the SIQA dataset using ASMR-C (Concatenation) with top-3 retrieval
Consistently outperforms Zero-Shot Self-Consistency (ZS-SC) and Multiple Choice Prompting (MCP) across CSQA, SIQA, and ARC datasets
ASMR-C achieves 60.9% accuracy on CSQA (vs 51.2% baseline) and 72.6% on ARC-Easy (vs 58.9% baseline)

Breakthrough Assessment

7/10

Significant performance gains (+15% on SIQA) using a simple, model-agnostic prompting strategy that aligns well with human intuition. However, it relies on existing components (SimCSE) rather than new architecture.

⚙️ Technical Details

Problem Definition

Setting: Multiple-choice commonsense reasoning task

Inputs: A question x and a set of answer choices Y = {y1, y2, ..., yn}

Outputs: The selected answer choice y_hat from Y

Pipeline Flow

Step 1: Open-Ended Answer Generation (Greedy, Beam, Sampling)
Step 2: Semantic Matching (calculate similarity between generated answers and choices)
Step 3: Relevant Answer Retrieval (Select Top-k choices)
Step 4: Final Answer Extraction (Re-prompt with Top-k choices)

System Modules

Generator

Generate preliminary open-ended answers to the question without seeing choices

Model or implementation: Llama-2-7b-chat-hf (also tested with Mistral-7B-Instruct-v0.1)

Matcher

Compute semantic similarity between generated answers and provided choices

Model or implementation: SimCSE (sup-simcse-bert-base-uncased)

Predictor

Select the final answer from the filtered choices

Model or implementation: Llama-2-7b-chat-hf

Novel Architectural Elements

Two-stage prompting pipeline: Open-ended generation -> Semantic retrieval -> Restricted choice prompting
Aggregation of generated answers via Concatenation (ASMR-C) or Score Summation (ASMR-A) to robustly match choices

Modeling

Base Model: Llama-2-7b-chat-hf (primary), Mistral-7B-Instruct-v0.1 (verification)

Compute: Experiments run on RTX 3090 GPU with Intel Core i9-13900K CPU

Comparison to Prior Work

vs. MCP: ASMR filters choices based on model's internal prior knowledge (open-ended generation) before asking it to choose
vs. ZS-SC: ASMR focuses on retrieval of choices via semantic matching rather than just self-consistency of the final answer selection
vs. CP: ASMR uses generative prompting followed by retrieval, avoiding calibration issues of probability scoring [not cited in paper]

Limitations

Requires an external embedding model (SimCSE) for semantic matching
Inference cost is higher than simple MCP due to multiple generation passes (Step 1) and similarity calculation
Only evaluated on Llama-2-7b and Mistral-7B; larger models not tested
Performance depends on the quality of the initial open-ended generation; if the model hallucinates wildly, retrieval may fail

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on validation sets of commonsense reasoning benchmarks

Benchmarks:

CommonsenseQA (CSQA) (5-way multiple choice commonsense reasoning)
SocialIQA (SIQA) (3-way multiple choice social reasoning)
AI2 Reasoning Challenge (ARC-Easy & ARC-Challenge) (4-way multiple choice science reasoning)

Metrics:

Accuracy
Top-k Accuracy (for intermediate retrieval step)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing ASMR variants against baselines (MCP and ZS-SC) using Llama-2-7b.
SIQA	Accuracy	45.0	60.3	+15.3
CSQA	Accuracy	59.2	61.3	+2.1
ARC-Easy	Accuracy	72.5	72.6	+0.1
ARC-Challenge	Accuracy	51.5	53.8	+2.3
Model generalization results verifying ASMR effectiveness on Mistral-7B.
CSQA	Accuracy	51.5	64.8	+13.3
ARC-Easy	Accuracy	74.7	78.8	+4.1

Main Takeaways

ASMR consistently outperforms standard Multiple Choice Prompting (MCP) across all datasets, often by large margins (up to 15.3%)
Combining responses from multiple decoding strategies (Greedy + Beam + Sampling) works better than any single strategy alone
Retrieving Top-3 choices for the final prompt generally yields better accuracy than selecting just the Top-1 or Top->Avg choices (except for CSQA where Top-1/Top-3 are close)
The method is model-agnostic, showing strong improvements on both Llama 2 and Mistral

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting strategies (Zero-shot, Few-shot)
Familiarity with semantic similarity and embeddings
Knowledge of decoding strategies (Greedy search, Beam search, Temperature sampling)

Key Terms

ASMR: Aggregated Semantic Matching Retrieval—the proposed method that generates open-ended answers first to retrieve relevant choices

MCP: Multiple Choice Prompting—a baseline method where the question and all answer choices are directly provided to the LLM

ZS-SC: Zero-Shot Self-Consistency—a method that samples multiple reasoning paths and aggregates answers via majority vote

SimCSE: Simple Contrastive Learning of Sentence Embeddings—a framework for learning sentence embeddings used here to measure similarity between generated answers and choices

CSQA: CommonsenseQA—a dataset for commonsense reasoning

SIQA: SocialIQA—a dataset for social and emotional intelligence reasoning

ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions (Easy and Challenge sets)

Open-Ended Question Answering: Prompting the model with just the question (no choices) to generate a free-text response

Cosine Similarity: A metric used to measure the similarity between two non-zero vectors (text embeddings)

Beam Search: A decoding strategy that explores multiple probable next tokens to generate text

Greedy Search: A decoding strategy that selects the most probable token at each step