Interpretable machine learning framework for air quality prediction in Istanbul using Shapley additive explanations (SHAP)

Research output: Contribution to journalArticlepeer-review

Abstract

This study develops a season-aware machine-learning (ML) framework to predict hourly concentrations of PM10, PM2.5 and O3 across İstanbul. A comprehensive 2021–2023 dataset was compiled from three co-located air-quality and meteorological monitoring stations that typify contrasting source regimes, i.e., a traffic-dominated urban site, a rural background site, and a semi-urban coastal site. Seven ML algorithms, namely eXtreme Gradient Boosting (XGBoost), Extra Trees (ETR), Random Forest (RF), Adaptive Boosting (AdaBoost), Multi-Layer Perceptron (MLP), k-Nearest Neighbors (KNN) and Support Vector Regression (SVR), were utilized to establish a holistic comparison scheme. Hyperparameters were optimized using five-fold cross-validated Bayesian search, and models were evaluated with various performance indicators on season-withheld test sets. In the winter months, ETR achieved a mean R2 = 0.93 (RMSE ≈ 10 µg/m3) for PM10 at Bağcılar, while XGBoost yielded R2 = 0.88 for O3 at the same site. Summer predictions were more challenging. PM10 skill in rural Arnavutköy dropped to R2 = 0.61 despite strong training fits, highlighting over-fitting risks under complex, non-stationary chemical conditions. By contrast, MLP maintained robust urban performance for PM2.5 (summer test R2 = 0.80) and KNN provided the most stable O3 prediction in rural areas (R2 = 0.74). To enhance interpretability, SHAP (SHapley Additive exPlanations) analysis was applied to the best-performing models, enabling a transparent assessment of how meteorological and co-pollutant inputs shaped predictions at each site. The proposed framework demonstrates that data-driven models can complement traditional air-quality modeling systems by providing station-level insights and interpretable relationships between pollutants and meteorological drivers, supporting air-quality assessment and policy-relevant analyses in rapidly urbanizing regions.

Original languageEnglish
Article number37
JournalStochastic Environmental Research and Risk Assessment
Volume40
Issue number2
DOIs
Publication statusPublished - Feb 2026

Bibliographical note

Publisher Copyright:
© The Author(s) 2026.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 11 - Sustainable Cities and Communities
    SDG 11 Sustainable Cities and Communities
  2. SDG 14 - Life Below Water
    SDG 14 Life Below Water

Keywords

  • Machine learning
  • O
  • PM
  • PM
  • SHAP
  • İstanbul

Fingerprint

Dive into the research topics of 'Interpretable machine learning framework for air quality prediction in Istanbul using Shapley additive explanations (SHAP)'. Together they form a unique fingerprint.

Cite this