QT-GILD: Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all...
Saved in:
Published in | Research in Computational Molecular Biology Vol. 13278; pp. 159 - 176 |
---|---|
Main Authors | , , , , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer International Publishing AG
2022
Springer International Publishing |
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present QT-GILD, an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing (NLP), which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical data sets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but it can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data. QT-GILD is freely available in open source form at https://github.com/pythonLoader/QT-GILD. |
---|---|
Bibliography: | S. Mahbub and S. Sawmya—These authors contributed equally to this work.The preprint version of the paper (available at https://doi.org/10.1101/2021.11.03.467204) contains the supplementary material. |
ISBN: | 3031047486 9783031047480 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-031-04749-7_10 |