Predicting defects with latent and semantic features from commit logs in an industrial setting

Beyza Eken, Rifat Atar, Sahra Sertalp, Ayse Tosun

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Citations (Scopus)

Abstract

Software defect prediction is still a challenging task in industrial settings. Noisy data and/or lack of data make it hard to build successful prediction models. In this study, we aim to build a change-level defect prediction model for a software project in an industrial setting. We combine various probabilistic models, namely matrix factorization and topic modeling, with the expectation of overcoming the noisy and limited nature of industrial settings by extracting hidden features from multiple resources. Commit level process metrics, latent features from commits, and semantic features from commit messages are combined to build the defect predictors with the use of Log Filtering and feature selection techniques, and two machine learning algorithms Naive Bayes and Extreme Gradient Boosting (XGBoost). Collecting data from various sources and applying data pre-processing techniques show a statistically significant improvement in terms of probability of detection by up to 24% when compared to a base model with process metrics only.

Original languageEnglish
Title of host publicationProceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASEW 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages98-105
Number of pages8
ISBN (Electronic)9781728141367
DOIs
Publication statusPublished - Nov 2019
Event34th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASEW 2019 - San Diego, United States
Duration: 10 Nov 201915 Nov 2019

Publication series

NameProceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASEW 2019

Conference

Conference34th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASEW 2019
Country/TerritoryUnited States
CitySan Diego
Period10/11/1915/11/19

Bibliographical note

Publisher Copyright:
© 2019 IEEE.

Funding

This study is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) under the project 5170048.

FundersFunder number
TUBITAK5170048
Türkiye Bilimsel ve Teknolojik Araştirma Kurumu

    Keywords

    • Matrix factorization
    • Software defect prediction
    • Topic modeling
    • Xgboost

    Fingerprint

    Dive into the research topics of 'Predicting defects with latent and semantic features from commit logs in an industrial setting'. Together they form a unique fingerprint.

    Cite this