Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when a...

Full description

Saved in:

Bibliographic Details
Published in	Genome research Vol. 33; no. 7; pp. 1061 - 1068
Main Authors	Rahman Hera, Mahmudur, Pierce-Ward, N. Tessa, Koslicki, David
Format	Journal Article
Language	English
Published	United States Cold Spring Harbor Laboratory Press 01.07.2023
Subjects	Biological Evolution Confidence Intervals Estimates Genomes Metagenome Metagenomics Metagenomics - methods Methods Mutation Mutation Rate Mutation rates Statistical analysis
Online Access	Get full text
ISSN	1088-9051 1549-5469 1549-5469
DOI	10.1101/gr.277651.123

Cover

Loading…

More Information
Summary:	Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1088-9051 1549-5469 1549-5469
DOI:	10.1101/gr.277651.123