Effective Feature Selection for Classification of Promoter Sequences

Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of ex...

Full description

Saved in:
Bibliographic Details
Published inPloS one Vol. 11; no. 12; p. e0167165
Main Authors K, Kouser, P G, Lavanya, Rangarajan, Lalitha, K, Acharya Kshitish
Format Journal Article
LanguageEnglish
Published United States Public Library of Science 15.12.2016
Public Library of Science (PLoS)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Competing Interests: The authors have declared that no competing interests exist, even though one of the authors (AKK) is affiliated to a private company (Shodhaka Life Sciences Pvt. Ltd) and this may be perceived by some as a potential conflict. But AKK is the founder director of this startup with focus on data analysis services for other scientists. Even though he has used the help of a staff member in extracting some of the data for the current work, the company has no commercial relevance to any part of the current study objectives or manuscript preparation.
Conceptualization: KK LR.Data curation: AKK.Formal analysis: KK LPG LR AKK.Investigation: KK LR AKK.Methodology: KK LR LPG AKK.Project administration: LR.Resources: KK LPG LR AKK.Software: KK LPG.Supervision: LR AKK.Validation: KK LR AKK.Visualization: KK LR AKK.Writing – original draft: KK LR AKK.Writing – review & editing: KK LR AKK.
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0167165