Connecting the Knowledge Dots: Retrieval-augmented Knowledge Connection for Commonsense Reasoning

📝 Paper Summary

Modularized RAG pipeline Commonsense Reasoning

RECONNECT transforms indirectly relevant retrieved documents into direct, question-specific explanations by extracting knowledge from diverse document subsets and aggregating them before inference.

Core Problem

Commonsense reasoning requires implicit knowledge rarely stated explicitly in text, causing standard retrieval to return documents that are only indirectly relevant and lack the direct information needed to answer the question.

Why it matters:

LLMs struggle with commonsense reasoning because the necessary knowledge is implicit and not directly represented in the question text
Existing Retrieval-Augmented Language Models (RALMs) often retrieve documents that do not directly contain the answer, leading to a gap between retrieved context and useful reasoning
Finite knowledge bases may not cover the specific direct information required, limiting generalizability on out-of-domain tasks

Concrete Example: Question: 'What happens when a crumpled and flat sheet of paper drop?' Standard retrieval finds facts about 'crumpling paper' or 'dropping items' generally. RECONNECT synthesizes disparate facts (one doc mentions air resistance depends on shape, another says crumpled paper has less resistance) into a direct explanation: 'The crumpled paper has less air resistance, so its greater net force makes it fall faster.'

Key Novelty

Retrieval-augmented knowledge Connection (RECONNECT)

Explanation-guided retrieval: Expands the original query into a detailed explanation to retrieve contextually aligned documents rather than just keyword matching
Relevance-based document sampling: Stochastically selects document subsets that balance relevance to the question and diversity among documents to capture multiple perspectives
Knowledge connection: Extracts relevant knowledge from these diverse subsets and aggregates them into a single, coherent, direct explanation used for final inference

Architecture

The RECONNECT pipeline: Query Expansion → Retrieval → Relevance-based Subset Sampling → Knowledge Extraction from subsets → Aggregation into a direct explanation → Answer Prediction.

Evaluation Highlights

+2.0% average accuracy improvement over SOTA baseline (ZEBRA) on 8 in-domain commonsense benchmarks using Llama 3.1-8B Instruct
+4.6% average accuracy improvement over SOTA baseline (ZEBRA) on 8 out-of-domain benchmarks, demonstrating strong generalization
Outperforms supervised knowledge generation methods (like COCONUT) without requiring additional fine-tuning of a specific generation model

Breakthrough Assessment

8/10

Significant gains on both ID and OOD tasks by addressing the specific semantic gap in commonsense retrieval. The shift from simple retrieval to 'retrieval-then-synthesis' is a strong methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot commonsense question answering using retrieval augmentation

Inputs: Question Q and choices C

Outputs: Predicted answer A (choice index)

Pipeline Flow

Explanation-guided Retrieval (Query Expansion → Retrieval)
Knowledge Connection (Subset Sampling → Extraction → Aggregation)
Knowledge-Informed Reasoning (Final Answer Prediction)

System Modules

Explanation-based Query Generator (Explanation-guided Retrieval)

Generate comprehensive explanations supporting potential choices to use as search queries

Model or implementation: Llama 3.1-8B Instruct

Explanation-based Retriever (Explanation-guided Retrieval)

Retrieve documents using both the original question and the generated explanations as queries

Model or implementation: E5-base-v2 (fine-tuned on ZEBRA-KB)

Document Sampler (Knowledge Connection)

Select diverse subsets of documents to minimize knowledge conflict and maximize perspective

Model or implementation: Algorithm (Stochastic Sampling)

Knowledge Extractor (Knowledge Connection)

Generate relevant knowledge from each document subset

Model or implementation: Llama 3.1-8B Instruct

Knowledge Aggregator (Knowledge Connection)

Synthesize the extracted knowledge into a final direct explanation

Model or implementation: Llama 3.1-8B Instruct

Reasoning Module

Predict the final answer using the question and the synthesized explanation

Model or implementation: Llama 3.1-8B Instruct

Novel Architectural Elements

Two-stage transformation pipeline: Instead of direct RAG, documents are first sampled into subsets, processed into intermediate knowledge, and then aggregated
Stochastic Relevance-based Sampling: A specific sampling algorithm balancing query relevance and intra-subset similarity to handle diverse perspectives

Modeling

Base Model: Llama 3.1-8B Instruct

Training Method: Fine-tuning (Retriever only)

Objective Functions:

Purpose: Train retriever to find documents supporting the reasoning process.

Formally: NCE loss L_Ret = -log( e^sim(q,d+) / (e^sim(q,d+) + sum(e^sim(q,d-))) )

Training Data:

ZEBRA-KB training set (questions and explanations)

Key Hyperparameters:

learning_rate: 1e-5
weight_decay: 1e-2
batch_size: 200 (negative samples)
+ 5 more
max_sequence_length: 256
training_steps: 25000
retrieved_documents_K: 5
knowledge_N: 3
temperature_tau: 1.0, 1.5, 2.0 (varied)

Compute: Four RTX A6000 GPUs

Comparison to Prior Work

vs. ZEBRA: RECONNECT synthesizes a direct explanation from retrieved docs rather than just using retrieved examples; RECONNECT generalizes better to OOD data (+4.6%)
vs. RACo: RECONNECT uses explanation-guided retrieval and multi-subset aggregation instead of single-pass retrieval
vs. COCONUT: RECONNECT does not require fine-tuning a specific generation model (supervised training) like COCONUT does
+ 1 more
vs. FLARE/DRAGIN: RECONNECT focuses on transforming indirect commonsense knowledge rather than just deciding when to retrieve

Limitations

Inference costs are higher due to multiple LLM calls for knowledge generation and aggregation
Exploration limited to multi-choice QA; applicability to open-ended QA or logical/arithmetic reasoning is future work
Performance saturation observed when increasing the number of generated knowledge statements beyond a certain point

Reproducibility

Code: https://github.com/JunhoKim94/ReConnect

publicly available (https://github.com/JunhoKim94/ReConnect). Retrieval corpus constructed from RACo, COCONUT, and ZEBRA-KB. Code and data available. Hyperparameters detailed in Appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot commonsense QA on 8 ID datasets and 8 OOD datasets

Benchmarks:

CommonsenseQA (CSQA) (Commonsense QA)
OpenBookQA (OBQA) (Science/Commonsense QA)
ARC-Easy / ARC-Challenge (Science Exam QA)
PIQA (Physical Commonsense)
HellaSwag (Sentence Completion (OOD))
Story Cloze Test (SCT) (Story Completion (OOD))
NumerSense (Numerical Commonsense (OOD))

Metrics:

Accuracy
Statistical methodology: Reported average results across three random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RECONNECT outperforms all baselines on In-Domain (ID) datasets, showing the benefit of the knowledge connection approach even when training data is available in the corpus.
Average (8 ID datasets)	Accuracy	78.4	80.4	+2.0
ARC-Challenge	Accuracy	83.8	85.0	+1.2
RECONNECT shows even stronger gains on Out-of-Domain (OOD) datasets, highlighting superior generalization compared to baselines that rely heavily on corpus coverage.
Average (8 OOD datasets)	Accuracy	75.5	80.1	+4.6
NumerSense	Accuracy	59.5	79.5	+20.0
Average (ID+OOD)	Accuracy	76.6	80.4	+3.8

Experiment Figures

Performance (Test Accuracy) vs. Number of Knowledge Statements (1 to 9) on CSQA, QuaRTz, ARC Challenge, and OBQA.

Human evaluation of generated knowledge quality (Relevance, Factuality, Helpfulness) for Single Retrieval, ZEBRA, and RECONNECT on ID (CSQA) and OOD (NumerSense) data.

Main Takeaways

Consistent gains across ID (+2.0%) and OOD (+4.6%) benchmarks confirm that connecting fragmented knowledge is superior to simple retrieval or generation alone
The method is particularly effective on tasks where direct knowledge is missing from the corpus (e.g., NumerSense +20%), proving the value of synthesizing indirect information
Ablation studies show that both 'Explanation-guided Retrieval' and 'Knowledge Connection' are essential; removing KC drops performance significantly
Human evaluation indicates RECONNECT generates knowledge with higher 'Relevance' and 'Helpfulness' than baselines, while maintaining high 'Factuality'

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Dense Passage Retrieval (DPR)
Commonsense Reasoning tasks

Key Terms

RALM: Retrieval-Augmented Language Model—an LLM enhanced with external knowledge retrieved from a corpus

RECONNECT: Retrieval-augmented knowledge Connection—the proposed framework that transforms indirect documents into direct explanations

ID: In-Domain—benchmarks where the training sets are included in the retrieval corpus

OOD: Out-of-Domain—benchmarks where the training sets are NOT included in the retrieval corpus

DPR: Dense Passage Retrieval—a method using dual encoders to retrieve relevant documents based on embedding similarity

NCE: Noise Contrastive Estimation—a loss function used to train the retriever to distinguish between positive and negative document pairs

MMR: Maximal Marginal Relevance—a method to select documents that are relevant to the query but also diverse from each other

RACo: Retrieval-Augmented Commonsense—a large-scale commonsense corpus used for retrieval

ZEBRA: Zero-shot Example-Based Retrieval Augmentation—a baseline method that retrieves relevant QA examples

COCONUT: Contextualized Commonsense Unified Transformers—a baseline method and corpus for graph-based commonsense augmentation