RETRACTED ARTICLE: A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative e...

Full description

Saved in:

Bibliographic Details
Published in	Journal of applied genetics Vol. 61; no. 2; pp. 231 - 238
Main Authors	Dehghanzadeh, Houshang, Ghaderi-Zefrehei, Mostafa, Mirhoseini, Seyed Ziaeddin, Esmaeilkhaniyan, Saeid, Haruna, Ishaku Lemu, Amirpour Najafabadi, Hamed
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.05.2020 Springer Springer Nature B.V
Subjects	Algorithms Analysis Animal Genetics and Genomics Animal Genetics • Original Paper Annotations Biomedical and Life Sciences Biomedical engineering Centroids Clustering Deoxyribonucleic acid Divergence DNA DNA sequencing Entropy Entropy (Information theory) Exons Genes Human Genetics Information theory Life Sciences Machine learning Mathematical analysis Metabolic pathways Microbial Genetics and Genomics Nucleotide sequence Nucleotide sequencing Plant Genetics and Genomics Sequences Software engineering Iran Kullback-Leibler divergence Gene clustering Dairy cattle Information theory
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.
ISSN:	1234-1983 2190-3883
DOI:	10.1007/s13353-020-00543-x