Splice site identification in human genome using random forest

Elham Pashaei, Mustafa Ozen, Nizamettin Aydin*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

17 Citations (Scopus)

Abstract

Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.

Original languageEnglish
Pages (from-to)141-152
Number of pages12
JournalHealth and Technology
Volume7
Issue number1
DOIs
Publication statusPublished - 1 Mar 2017
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2016, IUPESM and Springer-Verlag Berlin Heidelberg.

Keywords

  • DNA encoding methods
  • Gene detection
  • Random Forest classifier
  • Splice site prediction

Fingerprint

Dive into the research topics of 'Splice site identification in human genome using random forest'. Together they form a unique fingerprint.

Cite this