Investigating Tabular Generative Models for Synthetic Data Generation in PDAC Bulk Gene Expression Data

Sultan Sevgi Turgut Ögme, Zeyneb Kurt, Nizamettin Aydin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Pancreatic Ductal Adenocarcinoma (PDAC) is among the deadliest cancer types, with early detection being critical to improving survival rates. However, developing effective detection models is challenging due to the need for high-quality, class-balanced datasets. Generative models have recently gained attention for addressing this issue. In this study, we compare three tabular data-based generative models: Conditional Tabular Generative Adversarial Networks (CTGAN), Tabular Variational Autoencoder (TVAE), and Gaussian Copula (GC) using PDAC gene expression data. We first constructed an integrated dataset by curating six PDAC studies and applied an ensemble-based feature selection approach combining Differential Expression (DEG) analysis, ANOVA, Lasso, and Mutual Information. The synthetic data were evaluated both statistically (using Correlation Discrepancy (CD), Kolmogorov-Smirnov(KS), and Statistical Similarity(SS) metrics) and biologically (via PDAC marker genes), as well as visually in 2D-PCA space. The GC model produced the most realistic synthetic data with 0.1482 CD, 0.8120 KS, and 0.9529 SS metric values, similar expression level with PDAC markers, and uniform distribution with real data. TVAE followed GC. Based on these findings, we proposed an ensemble model combining GC and TVAE-generated samples. Classification experiments using Random Forest (RF) and Support Vector Machine (SVM) demonstrated that, while the ensemble generative model did not achieve the highest performance (0.8541 precision, 0.8570 recall, 0.8533 F1-measure and 0.9236 AUC) for SVM but achieved (0.8549 precision, 0.8623 recall, 0.8568 F1-measure and 0.9246 AUC) for RF, so it is a promising model for future applications.

Original languageEnglish
Title of host publicationProceedings of the 7th International Conference on Statistics
Subtitle of host publicationTheory and Application, ICSTA 2025
EditorsNoelle Samia, Dirk Husmeier
PublisherAvestia Publishing
ISBN (Print)9781990800597
DOIs
Publication statusPublished - 2025
Event7th International Conference on Statistics: Theory and Applications, ICSTA 2025 - Paris, France
Duration: 17 Aug 202519 Aug 2025

Publication series

NameProceedings of the International Conference on Statistics
ISSN (Electronic)2562-7767

Conference

Conference7th International Conference on Statistics: Theory and Applications, ICSTA 2025
Country/TerritoryFrance
CityParis
Period17/08/2519/08/25

Bibliographical note

Publisher Copyright:
© 2025, Avestia Publishing. All rights reserved.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • ensemble
  • gene expression
  • generative models
  • pancreatic cancer

Fingerprint

Dive into the research topics of 'Investigating Tabular Generative Models for Synthetic Data Generation in PDAC Bulk Gene Expression Data'. Together they form a unique fingerprint.

Cite this