Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach

Research in protein structure and function is one of the most important subjects in modern bioinformatics and computational biology. It often uses advanced data mining and machine learning methodologies to perform prediction or pattern recognition tasks. This paper describes a new method for predict...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Data Mining in Pattern Recognition pp. 334 - 345
Main Authors	Kurgan, Lukasz, Homaeian, Leila
Format	Book Chapter
Language	English Japanese
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2005
Series	Lecture Notes in Computer Science
Subjects	Composition Vector Feature Selection Feature Subset Multiple Linear Regression Multiple Linear Regression Model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Research in protein structure and function is one of the most important subjects in modern bioinformatics and computational biology. It often uses advanced data mining and machine learning methodologies to perform prediction or pattern recognition tasks. This paper describes a new method for prediction of protein secondary structure content based on feature selection and multiple linear regression. The method develops a novel representation of primary protein sequences based on a large set of 495 features. The feature selection task performed using very large set of nearly 6,000 proteins, and tests performed on standard non-homologues protein sets confirm high quality of the developed solution. The application of feature selection and the novel representation resulted in 14-15% error rate reduction when compared to results achieved when standard representation is used. The prediction tests also show that a small set of 5-25 features is sufficient to achieve accurate prediction for both helix and strand content for non-homologous proteins.
ISBN:	9783540269236 3540269231
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11510888_33