A Survey of Reasoning with Foundation Models

📝 Paper Summary

Foundation Models (LLMs, Multimodal) Artificial General Intelligence (AGI) Reasoning Methodologies

This survey systematizes the field of reasoning with foundation models by categorizing tasks, techniques, and benchmarks to bridge the gap between pattern recognition and human-like logical deliberation.

Core Problem

While foundation models excel at pattern recognition (System 1), their ability to perform deliberate, logical, and complex reasoning (System 2) remains debated and fragmented across various domains.

Why it matters:

Reasoning is a fundamental requirement for Artificial General Intelligence (AGI), enabling complex problem-solving in negotiation, medical diagnosis, and law.
Current research is scattered across isolated domains (e.g., math vs. commonsense), lacking a unified view of how foundation models can be adapted for general reasoning.
Transitioning from implicit 'System 1' intuition to explicit 'System 2' logical analysis is critical for reliability, interpretability, and robustness in real-world AI applications.

Concrete Example: In a social reasoning task, a model must infer a person's emotional state (e.g., 'happy' vs 'likes cold') not just by matching keywords, but by logically connecting context ('moved to Florida') with preferences ('found Northeast too cold').

Key Novelty

Unified Taxonomy of Reasoning with Foundation Models

Proposes a comprehensive taxonomy classifying reasoning into specific domains (Commonsense, Math, Logical, Causal, Visual, Audio, Multimodal, Embodied).
Integrates diverse foundation model types (Language, Vision, Multimodal) with reasoning-specific techniques like Chain-of-Thought (CoT) and autonomous agents.

Architecture

An overview of the reasoning landscape, mapping 'Reasoning Tasks' to 'Reasoning Techniques' and 'Support'.

Evaluation Highlights

Surveys over 650 papers, categorizing them into tasks, techniques, and benchmarks.
Highlights performance of specific models like Minerva (540B parameters), which answers nearly one-third of 200+ undergraduate-level science problems.
identifies that Chain-of-Thought (CoT) prompting enables zero-shot reasoning on arithmetic benchmarks (GSM8K, SVAMP) without handcrafted examples.

Breakthrough Assessment

9/10

A highly extensive and timely survey that organizes a rapidly exploding field. It provides a critical roadmap for researchers by connecting disparate sub-fields (multimodal, logical, agentic) under the umbrella of reasoning.

⚙️ Technical Details

Problem Definition

Setting: General survey and taxonomy construction

Inputs: Literature on Foundation Models (FMs) and Reasoning tasks

Outputs: Structured taxonomy of Reasoning Tasks, Techniques, and Benchmarks

Pipeline Flow

Foundation Model Pre-training (Language, Vision, or Multimodal)
Adaptation (Fine-tuning, Prompt Engineering)
Reasoning Technique Application (CoT, Decomposition)
Task Execution (Commonsense, Math, Logical, etc.)

System Modules

Language Foundation Models (Foundation Models)

Serve as the core reasoning engine for textual tasks

Model or implementation: Examples: GPT-4, PaLM, Llama 2, PanGu-Σ

Vision/Multimodal Foundation Models (Foundation Models)

Process visual and cross-modal information for reasoning

Model or implementation: Examples: ViT, SAM, CLIP, GPT-4V, VideoMAE V2

In-Context Learning / CoT

Elicit reasoning capabilities without weight updates

Model or implementation: Prompting strategies (Zero-shot-CoT, Few-shot CoT)

Novel Architectural Elements

Integration of 'System 2 Attention' (S2A) to filter context and regenerate high-quality content for reasoning
Hierarchical taxonomies linking formal logic (deductive/inductive/abductive) with neural foundation model techniques

Modeling

Base Model: Survey covers multiple models: GPT-4, PaLM, Llama 2 (7B-65B), PanGu-α (200B), PanGu-Σ (1T), ViT, SAM, CLIP.

Training Method: Survey discusses various methods: Pre-training (Self-supervised), Fine-tuning (Parameter-Efficient), Alignment (RLHF), and Inference-only techniques (CoT).

Adaptation: LoRA, Prompt Engineering, Linear Probing

Trainable Parameters: Ranges from 7B (Llama 2) to 1T (PanGu-Σ) depending on the specific model discussed.

Training Data:

Varies by model; typically massive web corpora (1.1TB for PanGu-α)
11 million images/1.1 billion masks for SAM

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Prompting: Emphasizes Chain-of-Thought and multi-step reasoning to emulate 'System 2' thinking
vs. Symbolic AI: Foundation models offer robustness to noise and generalization to unseen scenarios, unlike brittle formal systems
vs. Single-modality Models: Multimodal reasoning integrates visual/audio contexts for more human-like understanding (e.g., 'Caption Anything' framework)

Limitations

Foundation models often struggle with rigorous monotonic reasoning compared to formal logic systems.
Evaluation of complex reasoning tasks remains challenging due to hallucination and lack of ground truth.
The survey relies on reported results from other papers rather than a single unified experimental comparison.
Provided text (pages 1-4) does not contain the detailed quantitative results tables.

Reproducibility

Code: https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models

The authors provide a GitHub repository (https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models) containing a continuously updated reading list and benchmarks. Specific model weights discussed (e.g., GPT-4) are closed source, while others (Llama 2, SAM) are open.

📊 Experiments & Results

Evaluation Setup

Survey of performance across multiple domains using existing benchmarks.

Benchmarks:

Social IQA (Commonsense Reasoning (Social))
GSM8K (Mathematical Reasoning (Arithmetic))
AQUA-RAT (Mathematical Reasoning)
CLUTRR (Inductive Reasoning (Kinship))
VideoMAE V2 (Visual Reasoning (Action Detection))

Metrics:

Accuracy
Exact Match
Reasoning chain validity
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Foundation models have moved beyond simple pattern matching to perform complex 'System 2' tasks like mathematical derivation and commonsense inference.
Techniques like Chain-of-Thought (CoT) are pivotal, enabling zero-shot performance on tasks that previously required extensive fine-tuning.
Multimodal reasoning is rapidly advancing, with models like GPT-4V and SAM-based fusion models capable of interpreting and manipulating visual content based on text prompts.
There is a growing trend of 'Model Fusion', where specialized foundation models (e.g., SAM for segmentation + CLIP for recognition) are combined to solve complex tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Foundation Models (LLMs, Vision Transformers)
Basic logic concepts (Deduction, Induction, Abduction)
Familiarity with self-supervised learning

Key Terms

System 1 vs System 2: A dual-process theory of cognition: System 1 is fast/intuitive/unconscious, while System 2 is slow/deliberate/logical.

Chain-of-Thought (CoT): A prompting technique that encourages models to generate intermediate reasoning steps before producing a final answer.

Foundation Models: Large-scale models (e.g., GPT-4, Llama 2) trained on broad data that can be adapted to downstream tasks via fine-tuning or prompting.

In-Context Learning (ICL): The ability of a model to learn from a few examples provided in the prompt without updating its weights.

Mixture of Experts (MoE): A neural network architecture where different sub-models (experts) specialize in different parts of the input space, activated sparsely.

Zero-shot-CoT: A method where a model performs reasoning chains simply by being prompted with 'Let's think step by step', without needing example demonstrations.

Abductive Reasoning: Inferring the most plausible explanation or hypothesis for a set of observations (inference to the best explanation).

Multimodal Reasoning: Reasoning that integrates and processes information from multiple modalities simultaneously, such as text, images, and audio.