Multiclass unbalanced protein data classification using sequence features

Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm...

Full description

Saved in:

Bibliographic Details
Published in	2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology pp. 1 - 8
Main Authors	Suvarna Vani, K., Sravani, T.D.
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2014
Subjects	Accuracy AdaBoost Amino acids Boosting Clustering algorithms Feature extraction LogitBoost Oversampling Protein fold classification Proteins SMOTE Unbalanced data Vectors
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm called feature extraction algorithm is proposed to extract novel features from the primary sequences. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm like SMOTE technique of Chawla et al. [17] is applied to rebalance the data set and then apply different classifiers methods like J48 [15] decision tree classifier is used to classify folds from the features of sequences. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the sequences alone is to extract features based on the protein sequences and apply the extracted feature set to the improved oversampling method which reduces the imbalance present in the extracted feature set. In order to tackle the multi-classes we use different boosting algorithms like Adaboost and Logitboost which handle multi-datasets effectively.
DOI:	10.1109/CIBCB.2014.6845517