Guided Knowledge Generation with Language Models for Commonsense Reasoning

📝 Paper Summary

Commonsense Reasoning Knowledge Generation Small-scale LLMs

GuideKG improves commonsense reasoning in small LLMs by using a lightweight filter to select helpful knowledge sentences during generation, guiding the model toward accurate answers without human annotation.

Core Problem

Small-scale LLMs (under 10B parameters) struggle with commonsense reasoning because they often generate inaccurate or irrelevant knowledge, and external retrieval is limited by coverage and noise.

Why it matters:

LLMs frequently halluncinate incorrect facts during reasoning, leading to wrong answers in tasks requiring world knowledge.
Existing methods relying on external knowledge bases (like Wikipedia) fail when the retrieval system retrieves irrelevant context.
Manual annotation of 'good' vs 'bad' reasoning chains is expensive, hindering the training of effective verifiers.

Concrete Example: When answering 'It is labeled a month only if it lasts at least 31 days?', retrieved knowledge about 'weeks' is irrelevant. An unguided LLM might hallucinate that 'a month is defined as a specific number of days' (leading to a wrong answer). GuideKG filters these out, guiding the LLM to generate 'A month is considered long if it has 31 days... but February has 29', leading to the correct answer.

Key Novelty

Guided Knowledge Generation (GuideKG)

Treats knowledge generation as a search process where a small 'Know-Filter' scores the utility of each generated sentence for solving the specific question.
Automates training data creation by labeling generated knowledge based on whether it leads the LLM to the correct answer, eliminating manual labeling costs.
Injects the best filtered knowledge back into the prompt sentence-by-sentence to steer the LLM's subsequent generation (Sentence-Level Fusion Generation).

Architecture

The complete GuideKG framework. It illustrates the pipeline: generating candidate knowledge, filtering with Know-Filter, fusing top candidates, and iteratively generating the next sentence.

Evaluation Highlights

Outperforms standard prompting by +8.6% accuracy on CommonsenseQA using Vicuna-7B (70.8% vs 62.2%).
Surpasses retrieval-augmented baselines by +7.6% on CommonsenseQA2 using Vicuna-7B (60.4% vs 52.8%).
Achieves higher accuracy than self-consistency (SC) across four benchmarks, e.g., +2.4% over SC on ARC-Challenge with Vicuna-13B.

Breakthrough Assessment

7/10

Strong empirical results on small LLMs and a clever self-supervised training loop for the filter. While not a fundamental architectural shift, it offers a practical, cost-effective way to boost reasoning without massive external indices.

⚙️ Technical Details

Problem Definition

Setting: Multiple-choice question answering requiring commonsense reasoning.

Inputs: A question q and a set of candidate answers A.

Outputs: The most appropriate answer a* selected from A.

Pipeline Flow

Knowledge Generation: LLM generates multiple candidate knowledge sentences.
Knowledge Filtering: Know-Filter scores candidates; top candidates are selected.
Fusion & Iteration: Top candidates are fused into a context; LLM generates the next sentence based on this context.
Final Reasoning: LLM answers the question using the fully generated knowledge context.

System Modules

Knowledge Generator / Solver

Generates candidate knowledge sentences and predicts final answers.

Model or implementation: Alpaca-7B, Vicuna-7B, or Vicuna-13B

Know-Filter

Evaluates the effectiveness of generated knowledge for the specific question.

Model or implementation: MonoT5 (approx. 220M-880M params, based on T5-base/large)

Fusion Mechanism

Merges top-ranked knowledge sentences into a coherent context for the next generation step.

Model or implementation: Same LLM as Generator (via prompting)

Novel Architectural Elements

Sentence-Level Fusion Generation (SLFG) loop: Integrates filtering *during* the generation process (at sentence boundaries) rather than post-generation ranking.
Self-supervised feedback loop where the solver's answer probability (utility) directly supervises the filter's training via UWC loss.

Modeling

Base Model: Vicuna-7B (v1.5), Alpaca-7B, or Vicuna-13B (v1.5)

Training Method: Supervised Fine-tuning of the Know-Filter (MonoT5)

Objective Functions:

Purpose: Combine binary classification accuracy with fine-grained utility regression.

Formally: L = λ * L_cr + L_ce
Purpose: Standard cross-entropy loss for binary classification (Helpful vs Not Helpful).

Formally: L_ce = -1/m * sum(log p(t_i))
Purpose: Regularization loss to align filter probability with actual utility score (probability of correct answer).

Formally: L_cr = L2_norm(p(true) - y_true) + L2_norm(p(false) - (1 - y_true))

Training Data:

Generated by sampling knowledge from the LLM (Alpaca/Vicuna) for questions in CSQA/CSQA2 training sets.
Labels derived automatically: Positive if LLM answers correctly with high confidence; Negative otherwise.

Key Hyperparameters:

lambda_weight: 2
sampling_temperature: 1
top_p: 0.9
+ 2 more
N_fusion: 2 (number of sentences fused)
knowledge_samples: 10 per stage

Compute: Know-Filter uses 1% to 10% of the parameter count of the 7B inference model.

Comparison to Prior Work

vs. Rainier: GuideKG filters generated statements sentence-by-sentence using a utility-trained filter, whereas Rainier uses RL to optimize prompt generation.
vs. Retrieval: GuideKG relies on the LLM's internal parametric knowledge, avoiding noise from external retrieval failures.
vs. Self-Consistency: GuideKG explicitly guides the *content* of the knowledge generation phase, whereas SC only aggregates final answers.
+ 1 more
vs. GKP (Generated Knowledge Prompting) [cited]: GuideKG introduces an intermediate filtering and fusion step during generation, rather than just generating and using knowledge directly.

Limitations

Computational overhead: Multiple sampling and filtering steps increase inference cost compared to standard prompting.
Data dependency: The Know-Filter is trained on specific datasets (CSQA/CSQA2), which may limit zero-shot generalization to very different domains without retraining.
Requires high-quality generator: If the base LLM has no relevant knowledge in its parameters, the filter cannot recover correct information.

Reproducibility

Code: https://github.com/chenhaoran2018/GuideKG

Code is publicly available on GitHub. The Know-Filter training relies on generated data from the specific LLM being guided (e.g., Vicuna-7B), requiring a data generation step before training. Hyperparameters for MonoT5 fine-tuning are standard from pygaggle.ai.

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot inference on commonsense reasoning benchmarks using guided knowledge.

Benchmarks:

CommonsenseQA (CSQA) (Multiple-choice QA)
CommonsenseQA2 (CSQA2) (True/False QA)
StrategyQA (SQA) (True/False QA with implicit reasoning)
ARC-Challenge (ARC-c) (Grade-school science QA)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GuideKG consistently outperforms baselines (Self-Consistency, Retrieval, Verifier, Rainier) across different base models (Alpaca-7B, Vicuna-7B, Vicuna-13B) and benchmarks.
CommonsenseQA	Accuracy	68.3	70.8	+2.5
CommonsenseQA2	Accuracy	60.0	60.4	+0.4
ARC-Challenge	Accuracy	62.6	65.4	+2.8
StrategyQA	Accuracy	58.3	61.8	+3.5
CommonsenseQA	Accuracy	70.7	72.9	+2.2
Ablation studies demonstrate the necessity of each component, particularly the Know-Filter and the guidance mechanism.
CommonsenseQA	Accuracy	63.0	70.8	+7.8
CommonsenseQA	Accuracy	69.1	70.8	+1.7

Experiment Figures

Accuracy vs. Number of Sampled Paths for GuideKG and Self-Consistency (SC) on CommonsenseQA.

Performance gap between GuideKG and an 'Ideal' oracle filter (choosing the knowledge that actually maximizes answer probability).

Main Takeaways

Small-scale LLMs can achieve high performance if guided to generate the *right* knowledge; raw generation often contains noise.
The Know-Filter transfers well: a filter trained on 7B model data improves the 13B model's performance.
External retrieval often hurts performance on commonsense tasks compared to high-quality internal knowledge generation (Retrieval baselines often underperformed standard prompting).
Increasing the number of sampled knowledge paths improves performance significantly, validating the search-based approach.

📚 Prerequisite Knowledge

Prerequisites

Commonsense Reasoning benchmarks (CSQA, StrategyQA)
Knowledge Distillation / Knowledge Generation
Transformer-based Language Models (T5, LLaMA/Vicuna)

Key Terms

Know-Filter: A small auxiliary model (MonoT5) trained to predict whether a specific piece of generated knowledge will help the solver LLM answer the question correctly.

Utility Score: The probability assigned by the solver LLM to the correct answer option when provided with a specific piece of context knowledge.

UWC Loss: Utility-Weighted Classification Loss—a custom loss function that aligns the Know-Filter's predictions with the actual utility score (probability of correct answer) rather than just binary labels.

SLFG: Sentence-Level Fusion Generation—a decoding strategy where the LLM generates one sentence, the Know-Filter evaluates it, and the best sentences are fused to prompt the next sentence.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

MonoT5: A T5 (Text-to-Text Transfer Transformer) model fine-tuned as a point-wise ranker/classifier, often used in information retrieval.

greedy decoding: A decoding method that selects the most probable token at each step.

vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations.

Alpaca: An instruction-following language model fine-tuned from LLaMA on instruction-following demonstrations.