Influence of Sequence Length in Promoter Prediction Performance

The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several...

Full description

Saved in:

Bibliographic Details
Published in	Advances in Bioinformatics and Computational Biology Vol. 8826; pp. 41 - 48
Main Authors	Carvalho, Sávio G., Guerra-Sá, Renata, de C. Merschmann, Luiz H.
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2014 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Algorithms & data structures Data mining Life sciences: general issues Nucleosome Position Predictive Performance Promoter Sequence Sequence Length Short Processing Time
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infesible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a sistematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, sixteen datasets composed of different sized sequences are built and evaluated using the SVM and k-NN classifiers. The experimental results show that several datasets composed of shorter sequences acheived better predictive performance when compared with datasets composed of longer sequences and consumed a significantly shorter processing time.
Bibliography:	This research was partially supported by CNPq, FAPEMIG and UFOP.
ISBN:	331912417X 9783319124179
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-319-12418-6_6