Multiclass unbalanced protein data classification using sequence features
Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm...
Saved in:
Published in | 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology pp. 1 - 8 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.05.2014
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm called feature extraction algorithm is proposed to extract novel features from the primary sequences. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm like SMOTE technique of Chawla et al. [17] is applied to rebalance the data set and then apply different classifiers methods like J48 [15] decision tree classifier is used to classify folds from the features of sequences. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the sequences alone is to extract features based on the protein sequences and apply the extracted feature set to the improved oversampling method which reduces the imbalance present in the extracted feature set. In order to tackle the multi-classes we use different boosting algorithms like Adaboost and Logitboost which handle multi-datasets effectively. |
---|---|
DOI: | 10.1109/CIBCB.2014.6845517 |