← Back to Paper List

Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research

Ali Sanaei, Ali Rajabzadeh
University of Chicago, Methods Academy
arXiv (2025)
Agent Factuality Benchmark Reasoning

📝 Paper Summary

LLM-assisted Qualitative Analysis Research Methodology Frameworks
The paper proposes a 'Depth vs. Autonomy' framework for qualitative research, arguing that high interpretive depth requires low model autonomy achieved through task decomposition and human supervision.
Core Problem
Social science researchers face a tension where LLMs have high processing power but lack hermeneutic reliability, often leading to 'interpretive bias' and 'low auditability' when models are granted too much autonomy.
Why it matters:
  • Unsupervised LLMs can hallucinate evidence or force-fit data into categories, compromising validity in high-stakes qualitative research
  • Current adoption lacks a standardized vocabulary to classify and evaluate how much agency is given to models versus humans
  • Naive delegation of deep interpretive tasks to LLMs results in opaque, non-reproducible workflows that obscure the analytical process
Concrete Example: When asked to find evidence of 'bicameralism' in a 7th-century letter (an impossible/anachronistic task), a model without an explicit abstention option fabricated 5-8 pieces of 'evidence'. When given an abstention option ('There is no evidence'), the model correctly returned zero evidence.
Key Novelty
Bounded Autonomy Framework (Depth × Autonomy Plane)
  • Conceptualizes LLM usage on two axes: 'Interpretive Depth' (surface extraction vs. deep hermeneutics) and 'Autonomy' (assistant vs. trustee)
  • Proposes 'Bounded Autonomy': as the interpretive depth of a research question increases, the realized autonomy of the model must decrease via strict decomposition
  • Introduces 'Vertical and Horizontal Decomposition': splitting tasks sequentially (vertical) or across dimensions (horizontal) to maintain human control over high-depth analysis
Evaluation Highlights
  • In a 7th-century text analysis, adding an explicit abstention option reduced hallucinated evidence counts from mean 7.36 to 0.16 (near-perfect refusal)
  • Survey of 56 social science papers reveals a weak negative correlation (-0.23) between Interpretive Depth and Reproducibility/Rigor
  • Multi-stage decomposition (vertical + horizontal) yielded identical quantitative scoring to single-pass methods but provided significantly richer, auditable evidence trails
Breakthrough Assessment
7/10
Strong methodological contribution establishing a formal vocabulary for LLM social science research. The empirical experiments are simple demonstrations rather than massive benchmarks, but the framework itself is highly practical.
×