Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

📝 Paper Summary

Fairness in LLM Recommendations Educational Recommender Systems

The paper reveals severe Western-centric and socioeconomic biases in LLM-based university recommendations and proposes a multi-dimensional framework to quantify demographic fit and geographic diversity.

Core Problem

LLMs used for educational guidance often perpetuate societal biases, recommending institutions that ignore a student's geographic, economic, or cultural context.

Why it matters:

University choice profoundly shapes career trajectories and socioeconomic mobility; biased advice can entrench global inequalities
Educational technology firms are deploying AI chatbots for high-stakes admissions guidance without transparency into their fairness
A 'rich-get-richer' effect occurs when models systematically under-represent institutions in the Global South

Concrete Example: When users from developing countries ask for university recommendations, LLMs repeatedly steer them toward elite Western institutions (U.S./U.K.) regardless of their economic status, effectively ignoring local or regionally accessible high-quality options.

Key Novelty

Dual-Lens Fairness Evaluation Framework (DRS & GRS)

Demographic Representation Score (DRS): Measures how well a recommendation fits a specific student profile by modeling 'socio-economic distance' (decay of opportunity over distance) and alignment with academic interests.
Geographic Representation Score (GRS): Evaluates the diversity of the recommendation set itself, penalizing models that only suggest universities from countries with large education sectors (like the U.S.) by normalizing against the country's actual academic size.

Architecture

The proposed Evaluation Framework structure, detailing the calculation of DRS and GRS metrics

Evaluation Highlights

52–80% of all recommendations from LLaMA-3.1, Gemma-7B, and Mistral-7B favor institutions in the U.S. and U.K., showing strong Western-centric bias.
LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, yet systemic disparities persist.
Strong gender stereotyping observed: female profiles are steered toward social sciences, males toward engineering, and transgender users disproportionately to gender studies.

Breakthrough Assessment

7/10

Strong empirical audit of a high-stakes domain (education) with a novel, theoretically grounded evaluation framework. While it doesn't propose a new model architecture, the metrics provide a necessary benchmark for fairness.

⚙️ Technical Details

Problem Definition

Setting: Open-ended recommendation of universities and academic programs based on simulated user profiles

Inputs: Natural language query representing a student profile (nationality, gender, economic status)

Outputs: List of recommended universities and programs

Pipeline Flow

Profile Generation (360 simulated profiles)
Query Formulation (Prompting LLMs)
Response Parsing & Mapping
Evaluation Calculation (DRS & GRS)

System Modules

Profile Generator

Create synthetic user profiles varying by Nationality (40), Economic Class (3), and Gender (3)

Model or implementation: Rule-based permutation

Recommendation Engine

Generate university recommendations based on profile queries

Model or implementation: LLaMA-3.1-8B-Instruct / Gemma-7B-Instruct / Mistral-7B-Instruct-v0.2

Evaluator

Compute DRS and GRS metrics for the generated recommendations

Model or implementation: Python-based Evaluation Framework

Novel Architectural Elements

Novel evaluation framework architecture integrating geographic data (distance) and external knowledge bases (QS rankings) to audit LLM outputs

Modeling

Base Model: Evaluated three models: LLaMA-3.1-8B-Instruct, Gemma-7B-Instruct, Mistral-7B-Instruct-v0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Zhang et al.: Focuses on educational domain rather than employment; introduces geographic distance as a key fairness metric
vs. Dudy et al.: Intersectionality of geography with economic status and gender, rather than just location
vs. Decoupes et al. [not cited in paper]: Decoupes analyzes semantic vs geographic distance in LMs; this paper creates a specific user-centric utility metric (DRS) rather than just measuring model knowledge
+ 1 more
Novel contribution: First framework to explicitly model 'educational accessibility' via geodesic distance decay in LLM evaluation

Limitations

Relies on geodesic distance as a proxy for accessibility, which ignores visa regimes, travel costs, and cultural factors
University reputation is based solely on QS World University Rankings, which may have its own biases
Analysis is limited to English-language queries and models
Did not evaluate proprietary models like GPT-4 due to cost/reproducibility focus on open-source

Reproducibility

Code: https://github.com/cerai-iitm/Academic-Recommendation-Framework

📊 Experiments & Results

Evaluation Setup

Simulated user study with 360 profiles asking for university recommendations

Benchmarks:

Simulated Academic Queries (Open-ended Recommendation) [New]

Metrics:

Demographic Representation Score (DRS)
Geographic Representation Score (GRS)
Socio-Economic Accessibility (Acc)
Reputation Alignment (Rep)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of geographic bias shows a strong preference for Western institutions across all models.
Simulated Academic Queries	Percentage of recommendations in US/UK	Not reported in the paper	52-80%	Not reported in the paper
Simulated Academic Queries	Unique Universities Recommended	Not reported in the paper	481	Not reported in the paper
Simulated Academic Queries	Unique Countries Covered	Not reported in the paper	58	Not reported in the paper

Experiment Figures

World map visualization of recommended universities vs. user locations

Main Takeaways

Models exhibit 'Prestige-Seeking' behavior: they prioritize high-ranking Western universities even for students with low economic status or high geographic barriers.
Gender stereotyping is rampant: Transgender users are disproportionately recommended Gender Studies and Social Work, while males get STEM recommendations.
Economic status inputs (Low/Middle/High) correlate with institutional prestige recommendations, potentially reinforcing socioeconomic stratification.
LLaMA-3.1 shows better geographic diversity than Gemma or Mistral but still suffers from significant systemic bias.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs)
Familiarity with recommender system metrics (precision, recall, fairness)
Understanding of bias in AI (allocative vs. representational harm)

Key Terms

DRS: Demographic Representation Score—measures how well recommendations fit a student's specific background (socio-economic, academic, reputation)

GRS: Geographic Representation Score—evaluates the global diversity and quality of the set of recommended universities

Distance-decay principle: A geographic concept used here to model how educational opportunity decreases as the physical/economic distance from the student increases

Vincenty’s formula: A method to calculate the geodesic distance between two points on the surface of a spheroid (Earth)

Jaccard index: A statistic used for gauging the similarity and diversity of sample sets; used here to measure overlap between student interests and university programs

Geodesic distance: The shortest path between two points on a curved surface (the Earth), used here as a proxy for socio-economic accessibility barriers