Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation

📝 Paper Summary

Memory recall Modularized RAG pipeline

ARM-RAG improves LLM problem-solving by storing successful reasoning chains (rationales) from past attempts in a dense retrieval index and using them as few-shot examples for similar future problems.

Core Problem

LLMs often fail to solve complex reasoning problems (like math) and do not learn from their successes or failures without expensive retraining or fine-tuning.

Why it matters:

Frozen LLMs are static and cannot acquire new problem-solving strategies over time
Fine-tuning approaches (like STaR) require substantial data and compute resources
Standard RAG typically retrieves factual documents, not problem-solving strategies or reasoning patterns needed for logic tasks

Concrete Example: When asked a math problem about house flipping profits, GPT-3.5 might miscalculate the initial value by adding repair costs incorrectly. However, if prompted with a 'rationale' (a step-by-step solution) from a similar correctly solved problem, it avoids this structural error.

Key Novelty

Auxiliary Rationale Memory (ARM)

Store the 'thought process' (step-by-step reasoning) of successfully solved problems in a vector database, rather than just factual documents
At inference time, retrieve these successful reasoning chains based on problem similarity to use as in-context learning demonstrations
Use 'obfuscation' (replacing nouns/names with nonsense words) during retrieval to force the retriever to match on problem structure/logic rather than surface-level keywords

Evaluation Highlights

+4.2% accuracy improvement (77.4% vs 73.2%) on GSM8K using Obfuscated ARM-RAG compared to the base GPT-3.5 baseline
Multi-attempt questioning (voting/best-of-N) alone achieves 91.9% accuracy, providing a rich source of correct rationales for the memory
Strong prompting (providing the correct answer as a hint) yields 80% accuracy, validating that optimal context significantly aids performance

Breakthrough Assessment

4/10

Proposes a logical extension to RAG (retrieving rationales), but the empirical gains are modest (+4%) and the system relies on a basic pipeline without novel training or architecture.

⚙️ Technical Details

Problem Definition

Setting: Grade-school math word problem solving using retrieval-augmented generation

Inputs: Math word problem P

Outputs: Numerical answer A with reasoning steps

Pipeline Flow

Input Question → Obfuscation (Optional) → Retrieval (Pyserini) → Prompt Construction → Generation (GPT-3.5)

System Modules

Obfuscator

Replace nouns and proper names with nonsense words to emphasize problem structure over semantics during retrieval

Model or implementation: GPT-3.5-turbo (used to identify nouns/names)

Retriever

Find top-k similar solved problems (rationales) from the memory

Model or implementation: Pyserini (Dense retrieval with FAISS)

Generator

Solve the math problem using the retrieved rationales as few-shot examples

Model or implementation: gpt-3.5-turbo

Novel Architectural Elements

Utilization of an 'Obfuscator' module specifically to decouple structural similarity from semantic/keyword similarity in dense retrieval

Modeling

Base Model: gpt-3.5-turbo

Compute: Experiments run on Google Colab with NVIDIA A100 GPUs

Comparison to Prior Work

vs. STaR: ARM-RAG uses retrieval (non-parametric memory) instead of fine-tuning (parametric memory) to utilize successful reasoning chains
vs. Standard RAG: Retrieves 'rationales' (reasoning paths) rather than factual documents
vs. Auto-CoT [not cited in paper]: Auto-CoT clusters questions to select diverse demonstrations; ARM-RAG retrieves specific similar demonstrations based on the query

Limitations

Obfuscation technique is rudimentary (GPT-based replacement) and only partially successful at isolating structure
Performance gains are relatively small (+2.1% over non-obfuscated RAG, +4.2% over baseline)
Reliance on a fixed memory of solved training problems; does not generate new knowledge
Evaluation limited to a single dataset (GSM8K) and single model family (GPT-3.5)

Reproducibility

Code: https://github.com/ericmelz/arm-rag

Code available at https://github.com/ericmelz/arm-rag. Data is a subset of GSM8K (7473 examples). The retriever uses Pyserini/FAISS. The LLM is accessed via API (GPT-3.5-turbo), meaning exact reproduction depends on API versioning.

📊 Experiments & Results

Evaluation Setup

Math word problem solving on the GSM8K dataset

Benchmarks:

GSM8K (Grade-school math word problems)

Metrics:

Accuracy (percentage of correct answers)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main experiments on GSM8K comparing baseline GPT-3.5 against ARM-RAG variants.
GSM8K	Accuracy	73.2	75.3	+2.1
GSM8K	Accuracy	73.2	77.4	+4.2

Main Takeaways

Retrieving relevant reasoning chains improves performance over standard prompting, but naive retrieval often fetches superficially similar (same topic) rather than structurally similar (same math logic) problems
Obfuscating the query (masking nouns/names) forces the dense retriever to focus slightly more on structure, yielding better demonstrations and higher accuracy
The 'upper bound' capability of the model is high (91.9% with multi-attempt voting), suggesting that the main bottleneck is selecting the right context/strategy
Strong negative prompting (providing incorrect answers as context) has little detrimental effect, whereas strong positive prompting (providing the answer) helps significantly

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Chain-of-Thought (CoT) prompting
Basic knowledge of dense retrieval (vector databases)

Key Terms

ARM-RAG: Auxiliary Rationale Memory for RAG—the proposed system that retrieves past reasoning steps to help solve new problems

Rationales: Step-by-step reasoning chains or 'scratchpads' explaining how a solution is derived

Obfuscation: A technique used here to mask nouns and names in a query with nonsense words (e.g., 'plumbuzzle') to force the retriever to focus on problem structure rather than topic

MIPS: Maximum Inner Product Search—algorithm used to find the most similar vectors in the retrieval index

STaR: Self-Taught Reasoner—a prior method that fine-tunes models on their own self-generated correct solutions; this paper cites it as a heavy-compute alternative

Pyserini: A Python toolkit for reproducible information retrieval used here for the dense retrieval implementation