Multiclass unbalanced protein data classification using sequence features

Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm...

Full description

Saved in:
Bibliographic Details
Published in2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology pp. 1 - 8
Main Authors Suvarna Vani, K., Sravani, T.D.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2014
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm called feature extraction algorithm is proposed to extract novel features from the primary sequences. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm like SMOTE technique of Chawla et al. [17] is applied to rebalance the data set and then apply different classifiers methods like J48 [15] decision tree classifier is used to classify folds from the features of sequences. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the sequences alone is to extract features based on the protein sequences and apply the extracted feature set to the improved oversampling method which reduces the imbalance present in the extracted feature set. In order to tackle the multi-classes we use different boosting algorithms like Adaboost and Logitboost which handle multi-datasets effectively.
DOI:10.1109/CIBCB.2014.6845517