Influence of Sequence Length in Promoter Prediction Performance
The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several...
Saved in:
Published in | Advances in Bioinformatics and Computational Biology Vol. 8826; pp. 41 - 48 |
---|---|
Main Authors | , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer International Publishing AG
2014
Springer International Publishing |
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infesible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a sistematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, sixteen datasets composed of different sized sequences are built and evaluated using the SVM and k-NN classifiers. The experimental results show that several datasets composed of shorter sequences acheived better predictive performance when compared with datasets composed of longer sequences and consumed a significantly shorter processing time. |
---|---|
Bibliography: | This research was partially supported by CNPq, FAPEMIG and UFOP. |
ISBN: | 331912417X 9783319124179 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-319-12418-6_6 |