Ana gezinime geç Aramaya geç Ana içeriğe geç

Investigating Tabular Generative Models for Synthetic Data Generation in PDAC Bulk Gene Expression Data

Araştırma sonucu: Kitap/Rapor/Konferans Bildirisinde BölümKonferans katkısıbilirkişi

Özet

Pancreatic Ductal Adenocarcinoma (PDAC) is among the deadliest cancer types, with early detection being critical to improving survival rates. However, developing effective detection models is challenging due to the need for high-quality, class-balanced datasets. Generative models have recently gained attention for addressing this issue. In this study, we compare three tabular data-based generative models: Conditional Tabular Generative Adversarial Networks (CTGAN), Tabular Variational Autoencoder (TVAE), and Gaussian Copula (GC) using PDAC gene expression data. We first constructed an integrated dataset by curating six PDAC studies and applied an ensemble-based feature selection approach combining Differential Expression (DEG) analysis, ANOVA, Lasso, and Mutual Information. The synthetic data were evaluated both statistically (using Correlation Discrepancy (CD), Kolmogorov-Smirnov(KS), and Statistical Similarity(SS) metrics) and biologically (via PDAC marker genes), as well as visually in 2D-PCA space. The GC model produced the most realistic synthetic data with 0.1482 CD, 0.8120 KS, and 0.9529 SS metric values, similar expression level with PDAC markers, and uniform distribution with real data. TVAE followed GC. Based on these findings, we proposed an ensemble model combining GC and TVAE-generated samples. Classification experiments using Random Forest (RF) and Support Vector Machine (SVM) demonstrated that, while the ensemble generative model did not achieve the highest performance (0.8541 precision, 0.8570 recall, 0.8533 F1-measure and 0.9236 AUC) for SVM but achieved (0.8549 precision, 0.8623 recall, 0.8568 F1-measure and 0.9246 AUC) for RF, so it is a promising model for future applications.

Orijinal dilİngilizce
Ana bilgisayar yayını başlığıProceedings of the 7th International Conference on Statistics
Ana bilgisayar yayını alt yazısıTheory and Application, ICSTA 2025
EditörlerNoelle Samia, Dirk Husmeier
YayınlayanAvestia Publishing
ISBN (Basılı)9781990800597
DOI'lar
Yayın durumuYayınlandı - 2025
Etkinlik7th International Conference on Statistics: Theory and Applications, ICSTA 2025 - Paris, France
Süre: 17 Ağu 202519 Ağu 2025

Yayın serisi

AdıProceedings of the International Conference on Statistics
ISSN (Elektronik)2562-7767

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???7th International Conference on Statistics: Theory and Applications, ICSTA 2025
Ülke/BölgeFrance
ŞehirParis
Periyot17/08/2519/08/25

Bibliyografik not

Publisher Copyright:
© 2025, Avestia Publishing. All rights reserved.

BM SKH

Bu sonuç, aşağıdaki Sürdürülebilir Kalkınma Hedefine/Hedeflerine katkıda bulunur

  1. SKH 3 - Sağlık ve Kaliteli Yaşam
    SKH 3 Sağlık ve Kaliteli Yaşam

Parmak izi

Investigating Tabular Generative Models for Synthetic Data Generation in PDAC Bulk Gene Expression Data' araştırma başlıklarına git. Birlikte benzersiz bir parmak izi oluştururlar.

Alıntı Yap