University of Illinois Urbana-Champaign,
The Hong Kong University of Science and Technology
arXiv, 8/2025
(2025)
ReasoningRAGQAAgent
📝 Paper Summary
Tabular ReasoningTable QAFact Verification
LRTab improves tabular reasoning by learning 'Prompt Conditions' from incorrect Chain-of-Thought predictions on training data, then retrieving these conditions at inference time to guide the LLM.
Core Problem
Current tabular reasoning approaches either fine-tune LLMs (costly, less generalizable) or use training-free prompting (highly generalizable but fails to utilize insights from labeled training data).
Why it matters:
Tabular data is ubiquitous in business and consumer applications but remains challenging due to inconsistent formatting and complex column relationships
Incorrect reasoning examples in training data reveal key knowledge gaps, yet current prompting methods discard them rather than learning from the mistakes
Fine-tuning requires task-specific data and lacks flexibility, while standard prompting misses the opportunity to 'learn' from the provided ground truth labels
Concrete Example:Initial attempts to correct LLMs using ground truth result in leakage (e.g., 'Given the answer is X, I should...'), which is unusable at test time. Standard prompting ignores these error cases entirely.
Key Novelty
Learn then Retrieve (LRTab)
Treats training data not just as few-shot examples, but as a source of error correction: generates 'Prompt Conditions' (guidelines) specifically to fix incorrect Chain-of-Thought (CoT) reasoning
Validates these conditions against ground truth to ensure they actually fix the error, creating a high-quality pool of interpretable hints
At inference, retrieves the most relevant Prompt Conditions (via similarity and reranking) to preemptively guide the LLM away from likely reasoning pitfalls
Architecture
The LRTab inference workflow, showing how Prompt Conditions are retrieved and added to the context.
Evaluation Highlights
Achieves 76.8% accuracy on WikiTQ with GPT-4o-mini, outperforming the previous best H-STAR (with same base model) by a significant margin
Attains 89.74% on TabFact with GPT-4o-mini, surpassing Mixed Self-Consistency and Chain-of-Table baselines
Flexible prompting (letting the model choose whether to code) improves accuracy by up to 3 points over direct prompting or forced coding
Breakthrough Assessment
7/10
Effective hybrid between fine-tuning and prompting. Successfully exploits training data for inference-only models without weight updates, achieving SOTA on standard benchmarks.
⚙️ Technical Details
Problem Definition
Setting: Table-based reasoning where input is a table T and query Q, and objective is to predict answer A_pred
Retrieve relevant Prompt Conditions based on table/query similarity
Model or implementation: Salesforce SFR-Embedding-Code-400M_R
Cross-Encoder Reranker (Retrieval)
Select the most useful Prompt Conditions from candidates
Model or implementation: nli-deberta-v3-large (fine-tuned on validation data)
LLM Agent
Generate reasoning steps (CoT) and final answer, optionally utilizing Python code
Model or implementation: GPT-4o or GPT-4o-mini
Novel Architectural Elements
Pipeline explicitly separates 'learning' (generating conditions from errors) from 'inference' (retrieving conditions)
Integration of error-correcting 'Prompt Conditions' into the ICL context
Modeling
Base Model: GPT-4o and GPT-4o-mini
Training Method: In-context learning with retrieved artifacts (Prompt Conditions) mined from training data. Note: The Cross-Encoder is fine-tuned.
Training Data:
Uses ~3000 samples from training datasets (WikiTQ, TabFact) to mine conditions
Validation data used to train the cross-encoder reranker (label 1 if condition helped, 0 otherwise)
Comparison to Prior Work
vs. Reinforced ICL: LRTab explicitly mines 'Prompt Conditions' from *incorrect* CoTs to prevent errors, rather than just retrieving correct examples
vs. Fine-tuning (UnifiedSKG): LRTab is training-free for the LLM itself, offering interpretability and lower cost
vs. Self-Correction (Huang et al.): LRTab uses ground truth during the 'learning' phase to guarantee correction validity, whereas self-correction without ground truth often fails
Retrieval based on semantic similarity significantly outperforms random retrieval
Performance scales positively with the number of retrieved prompt conditions, though context length is a limiting factor
📚 Prerequisite Knowledge
Prerequisites
Chain-of-Thought (CoT) prompting
In-Context Learning (ICL)
Retrieval-Augmented Generation (RAG)
Pandas/Python for tabular data processing
Key Terms
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Prompt Conditions: Short, natural language guidelines generated by the LLM to correct specific reasoning errors found during the training phase (e.g., 'Do not process datetimes with Python')
WikiTQ: WikiTableQuestions—a large-scale dataset for question answering on semi-structured Wikipedia tables
TabFact: A benchmark for binary fact verification (True/False) based on tabular data
Cross-encoder: A transformer model that takes a pair of sentences (or table/query pairs) as input and outputs a similarity or relevance score, used here for re-ranking
Reranker: A second-stage retrieval model that re-orders the top-k candidates from a simpler retriever to improve precision