RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering

📝 Paper Summary

Graph-based RAG pipeline

RPO-RAG improves small LLM reasoning on knowledge graphs by training them with relation-level preference signals derived from semantically sampled paths, rather than just final answer supervision.

Core Problem

Existing KG-based RAG methods rely on semantics-unaware shortest-path heuristics that introduce irrelevant noise, and they supervise models only on final answers, failing to teach small LLMs the intermediate reasoning steps required for complex queries.

Why it matters:

Small LLMs (sub-7B) lack the capacity to filter irrelevant retrieval noise or organize fragmented evidence, leading to hallucinations
Prior methods prioritize topological proximity over semantic relevance, causing models to learn incorrect reasoning patterns
Current flat-list prompts do not guide models to integrate evidence from multiple paths into a coherent answer

Concrete Example: For the query 'Who is the character in Juice and music producer of The Don Killuminati?', standard methods retrieve paths like 'Juice -> character -> Q' because 'Q' is topologically close, ignoring the 'music producer' constraint. RPO-RAG identifies the path connecting both constraints ('Juice' and 'The Don Killuminati') to the correct answer 'Bishop'.

Key Novelty

Relation-aware Weighted Preference Optimization for RAG

Replaces heuristic path sampling (e.g., BFS) with a query-path semantic sampling strategy that clusters paths by embedding similarity to find those matching query intent
Introduces a relation-level preference optimization objective that trains the LLM to prefer semantically relevant relations at each step of the reasoning path, rather than just the final answer
Restructures prompts into an 'answer-centered' format that groups all reasoning paths supporting a specific candidate answer together, helping small LLMs aggregate evidence

Architecture

The overall architecture of RPO-RAG, illustrating the pipeline from query to answer.

Evaluation Highlights

Achieves state-of-the-art results among sub-8B models on WebQSP (89.9 Hit) and CWQ datasets, surpassing the previous best (GCR) by +2.7% Hit and +10.2% F1 on WebQSP (Llama3.1-8B)
RPO-RAG with Llama3.2-3B improves Hit by +24.8% on WebQSP and +46.1% on CWQ compared to the vanilla base model, showing effective capability transfer to small models
Significantly closes the gap with large proprietary models: RPO-RAG (Llama3.2-1B) surpasses ToG (ChatGPT) by +6.1% Hit on WebQSP

Breakthrough Assessment

8/10

Strong empirical gains for small models, effectively enabling them to perform complex graph reasoning previously reserved for larger models. The relation-level optimization is a novel and logical extension of preference learning to KGQA.

⚙️ Technical Details

Problem Definition

Setting: Knowledge Graph Question Answering (KGQA) where the system retrieves reasoning paths from a KG to answer a natural language question

Inputs: Natural language question q and a Knowledge Graph G

Outputs: Answer entity e_a

Pipeline Flow

Group: Retrieval
Semantic-Matching Retriever (dynamic beam search to extract paths)
Group: Reasoning
Small LLM Reasoner (generates answer using answer-centered prompts)

System Modules

Semantic-Matching Retriever

Retrieve reasoning paths semantically aligned with the query

Model or implementation: Sentence-BERT (fine-tuned)

Small LLM Reasoner

Generate the final answer based on retrieved paths

Model or implementation: Llama-2-7B, Llama-3.1-8B, Llama-3.2-3B, or Llama-3.2-1B (fine-tuned)

Novel Architectural Elements

Integration of relation-level preference optimization into the training loop of the reasoner
Dynamic clustering module in data construction to automatically determine positive/negative paths for supervision

Modeling

Base Model: Llama-2-7B, Llama-3.1-8B, Llama-3.2-3B, Llama-3.2-1B

Training Method: Dual-Objective Optimization: Relation-aware Weighted Preference Optimization + Answer-Centered Prompt Optimization

Objective Functions:

Purpose: Maximize likelihood of correct answers given the prompt.

Formally: Standard Cross-Entropy Loss on the answer tokens.
Purpose: Optimize preference for semantically relevant relations in the reasoning path.

Formally: Margin-based preference loss L_rel = -log(sigmoid(beta * (w+ log P(y+|x) - w- log P(y-|x)) - gamma)), where w+ and w- are confidence weights based on semantic distance.

Adaptation: LoRA

Training Data:

WebQuestionsSP (WebQSP) and Complex WebQuestions (CWQ)
Training data created via Query-Path Semantic Sampling: gradient-based dynamic clustering of paths based on PLM embedding similarity to the query

Key Hyperparameters:

margin_gamma: Not explicitly reported in the paper
scaling_factor_beta: Not explicitly reported in the paper
decay_rate_alpha: Not explicitly reported in the paper

Compute: 2x NVIDIA RTX 4090 GPUs for fine-tuning. Inference on single NVIDIA RTX 3090.

Comparison to Prior Work

vs. RoG/GCR: RPO-RAG uses a lightweight PLM retriever instead of LLM generation, improving efficiency, and optimizes intermediate reasoning steps via preference learning
vs. SubgraphRAG/GNN-RAG: RPO-RAG incorporates semantic path sampling and relation-aware preference optimization, whereas others rely on heuristic sampling or standard supervision
vs. ToG [not cited in paper as direct architecture comparison, but as baseline]: ToG uses LLM for iterative thought-on-graph exploration; RPO-RAG separates retrieval and reasoning, optimizing the reasoner to handle retrieved paths better

Limitations

Relies on the availability and quality of a structured Knowledge Graph (Freebase)
Performance depends on the PLM's ability to embed query and path semantics accurately during sampling
Does not explicitly report hyperparameters for the preference optimization loss (alpha, beta, gamma)

Reproducibility

Code: https://github.com/KaeHyun/RPO-RAG

Source code available at https://github.com/KaeHyun/RPO-RAG. Trained models available at Zenodo. Uses standard benchmarks (WebQSP, CWQ) and open models (Llama series, Sentence-BERT). Hyperparameters for the preference loss (alpha, beta, gamma) are defined conceptually but exact values are not in the main text.

📊 Experiments & Results

Evaluation Setup

KGQA on Freebase-grounded datasets

Benchmarks:

WebQuestionsSP (WebQSP) (1-2 hop KGQA)
Complex WebQuestions (CWQ) (Multi-hop (up to 4 hops) KGQA)

Metrics:

Hit
F1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on WebQSP shows RPO-RAG achieving SOTA among sub-8B models and outperforming larger baselines.
WebQSP	Hit	87.2	89.9	+2.7
WebQSP	F1	71.1	81.3	+10.2
WebQSP	Hit	76.2	82.3	+6.1
Main comparison on CWQ demonstrates RPO-RAG's effectiveness on complex multi-hop queries.
CWQ	Hit	69.5	70.8	+1.3
CWQ	F1	57.7	62.6	+4.9
Comparison with vanilla models showing the impact of the RPO-RAG framework.
CWQ	Hit	20.2	66.3	+46.1

Experiment Figures

A motivating example comparing standard 'ungrouped' retrieval results vs. RPO-RAG's 'answer-centered' reasoning paths.

Illustration of the Query-Path Semantic Sampling process using gradient-based dynamic clustering.

Main Takeaways

RPO-RAG consistently outperforms existing KG-based RAG methods and vanilla LLMs across both datasets and various model sizes (1B to 8B).
The framework effectively bridges the performance gap between small LLMs (1B, 3B) and large closed-source models (ChatGPT, GPT-4), especially on WebQSP.
Efficiency analysis shows RPO-RAG offers the best accuracy-latency trade-off compared to SubgraphRAG, GCR, and GNN-RAG.
Ablation studies confirm the contribution of both relation-aware optimization and answer-centered prompting.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graph Question Answering (KGQA)
Retrieval-Augmented Generation (RAG)
Direct Preference Optimization (DPO) concepts
Beam search

Key Terms

KGQA: Knowledge Graph Question Answering—answering natural language questions using structured facts from a knowledge graph

RAG: Retrieval-Augmented Generation—enhancing LLMs by retrieving relevant external data before generating an answer

Preference Optimization: Training paradigm (like DPO or PPO) that aligns models with desired behaviors by comparing preferred vs. non-preferred outputs

Relation-aware: Focusing on the specific relationships (edges) in a knowledge graph path, rather than just the entities (nodes)

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique for LLMs

PLM: Pretrained Language Model—used here for embedding queries and paths to calculate semantic similarity

Beam search: A search algorithm that explores a graph by expanding the most promising nodes

Hit: Evaluation metric measuring whether the correct answer is present in the set of predicted answers

BFS: Breadth-First Search—a traversal algorithm that explores neighbor nodes first, often used as a baseline path finding heuristic