Eco-Amazon: Enriching E-commerce Datasets with Product Carbon Footprint for Sustainable Recommendations

📝 Paper Summary

Sustainable Recommender Systems Green AI Dataset Enrichment

Eco-Amazon introduces a zero-shot framework using Large Language Models to enrich e-commerce datasets with item-level Product Carbon Footprint metadata, overcoming the scarcity of official environmental data for recommender systems.

Core Problem

Recommender systems research lacks standard benchmarks with item-level environmental impact data (PCF), as official Life Cycle Assessment (LCA) databases are too sparse and expensive to cover massive e-commerce catalogs.

Why it matters:

E-commerce contributes substantially to global emissions, but systems cannot promote sustainable choices without item-level carbon data
Current approaches rely on proprietary databases or manual mapping, which are non-scalable and hinder reproducible research in sustainable AI
The lack of open PCF-enriched resources prevents the community from developing and benchmarking sustainability-aware ranking and recommendation algorithms

Concrete Example: A user searching for 'jeans' sees thousands of results ranked by popularity. Without PCF data, the system cannot distinguish between a high-carbon synthetic pair and low-carbon organic denim. Official databases like Environdec might cover only one specific brand, leaving the vast majority of the catalog unlabelled and effectively invisible to sustainability metrics.

Key Novelty

Zero-shot PCF Estimation via LLMs

Leverages the broad domain knowledge of Large Language Models to infer Product Carbon Footprint from unstructured text descriptions without domain-specific training
Constrains the LLM generation process using prompts based on international standards (GHG Protocol, ISO 14040) to ensure estimates align with Life Cycle Assessment principles rather than mere statistical guessing

Architecture

Conceptual flow of the zero-shot prompting strategy used to estimate PCF

Evaluation Highlights

Spearman rank correlation > 0.90 for both GPT-o3-mini and Gemini-2.5-flash across Electronics, Clothing, and Home & Kitchen domains, indicating high ordinal reliability
Low-impact products (the target for sustainable recommendations) are estimated with high precision, maintaining a Mean Absolute Error (MAE) below 6 kg CO2e across all domains
Enriched a total of 49,902 items across three Amazon datasets, creating the largest publicly available resource for PCF-aware recommendation research

Breakthrough Assessment

8/10

Addresses a critical data gap in sustainable AI by providing the first large-scale, open PCF-enriched e-commerce dataset. The zero-shot methodology is highly scalable, though absolute precision on high-impact items remains a challenge.

⚙️ Technical Details

Problem Definition

Setting: Estimate the cradle-to-grave Product Carbon Footprint (PCF) for e-commerce items based on textual metadata

Inputs: Product metadata $d_i$ (title, description, features)

Outputs: Estimated PCF value (kg CO2e)

Pipeline Flow

Data Filtering (k-core) & Sampling
Official Data Retrieval (Step 1)
LLM Inference (Step 2)
Dataset Integration

System Modules

Data Preprocessor

Filter raw Amazon reviews data using 15-core filtering and random sampling to create manageable subsets

Model or implementation: Script-based filtering

Official Data Checker (Estimation Engine)

Identify if official carbon footprint data exists (EPDs or manufacturer reports) for an item

Model or implementation: LLM (GPT-o3-mini or Gemini-2.5-flash)

PCF Estimator (Estimation Engine)

Synthesize PCF estimation from unstructured metadata using zero-shot reasoning constrained by LCA standards

Model or implementation: LLM (GPT-o3-mini or Gemini-2.5-flash) with zero-shot prompting

Novel Architectural Elements

Two-step prompting strategy explicitly enforcing ISO 14040 and GHG Protocol compliance within the prompt context to ground LLM hallucinations in established accounting principles

Modeling

Base Model: GPT-o3-mini and Gemini-2.5-flash

Comparison to Prior Work

vs. AutoPCF: Eco-Amazon generalizes to broad e-commerce categories (Clothing, Home) rather than narrow industrial materials
vs. PCF-RWKV: Does not require training/fine-tuning; relies on zero-shot inference with standard constraints
vs. Vicenti et al. (2026): Expands scope to multi-domain (Clothing, Home & Kitchen) and benchmarks multiple LLMs (GPT vs Gemini)
+ 1 more
vs. Climatiq API [not cited in paper]: Provides open item-level estimation for free vs. paid proprietary API access

Limitations

Absolute PCF estimates for high-impact items (outliers) show significant error margins
Reliance on LLM latent knowledge may introduce biases or hallucinations where product descriptions are vague
Ground truth validation is limited to 159 items due to the scarcity of official EPDs
Current approach does not use retrieval-augmented generation (RAG) to access external emission factor databases

Reproducibility

Code: http://github.com/giuspillo/EcoAmazon/

publicly available (http://github.com/giuspillo/EcoAmazon/). Enriched datasets and source code for the estimation script are released. Ground truth list of 159 items with official EPDs is also provided for benchmarking.

📊 Experiments & Results

Evaluation Setup

Validation against ground-truth PCF values from official Environmental Product Declarations (EPDs)

Benchmarks:

Ground Truth Dataset (PCF Estimation Accuracy & Ranking) [New]

Metrics:

Mean Absolute Error (MAE)
Spearman Rank Coefficient
Normalized Discounted Cumulative Gain (NDCG)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ranking performance metrics demonstrate that LLMs can reliably order products by environmental impact, even if absolute values fluctuate.
Ground Truth Dataset	Spearman Rank Coefficient	Not reported in the paper	> 0.9	Not reported in the paper
Absolute error analysis shows high precision for low-carbon items, which are most relevant for sustainable recommendations, but high error for carbon-intensive outliers.
Low-Impact Items (Ground Truth)	MAE (kg CO2e)	Not applicable	< 6.0	Not applicable
High-Impact Items (Home & Kitchen)	MAE (kg CO2e)	Not applicable	284.97	Not applicable

Experiment Figures

Breakdown of Mean Absolute Error (MAE) for GPT-o3-mini across three impact categories (Low, Medium, High)

Main Takeaways

LLMs demonstrate strong ordinal consistency (Spearman > 0.9), making them highly effective for ranking-based tasks like recommendation even when absolute PCF values are imprecise.
Global error metrics are disproportionately driven by high-impact outliers; the models are much more reliable for low-impact 'green' alternatives.
GPT-o3-mini generally outperforms Gemini-2.5-flash in absolute accuracy (lower MAE), though both are effective for ranking.
The approach successfully scales to ~50k items, enabling the creation of the first multi-domain sustainable recommendation benchmark.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (RS) and Information Retrieval (IR)
Understanding of Zero-shot prompting with LLMs
Familiarity with Life Cycle Assessment (LCA) concepts

Key Terms

PCF: Product Carbon Footprint—the total greenhouse gas emissions generated by a product over its life cycle, expressed as CO2 equivalent

LCA: Life Cycle Assessment—a methodology for assessing environmental impacts associated with all the stages of the life-cycle of a commercial product

GHG Protocol: Greenhouse Gas Protocol—a global standardized framework to measure and manage greenhouse gas emissions

Zero-shot: A machine learning setting where a model performs a task without having seen any specific training examples for that task

EPD: Environmental Product Declaration—a standardized document informing about a product's environmental performance

CO2e: Carbon Dioxide Equivalent—a metric measure used to compare the emissions from various greenhouse gases on the basis of their global-warming potential

MAE: Mean Absolute Error—measure of errors between paired observations expressing the same phenomenon

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality

k-core filtering: A data preprocessing step that keeps only users and items with at least k interactions to reduce sparsity