Indexing and searching petabase-scale nucleotide resources
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by prov...
Saved in:
Published in | Nature methods Vol. 21; no. 6; pp. 994 - 1002 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
New York
Nature Publishing Group US
01.06.2024
Nature Publishing Group |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at
https://pebblescout.ncbi.nlm.nih.gov
. We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
The Pebblescout tool achieves an efficient search for subjects in a large nucleotide database such as runs in Sequence Read Archive data. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Author contribution S.A.S. proposed the presented solution, designed, and implemented the software. R.A. managed the project, did testing, and found applications. Both authors contributed to building databases, data interpretation, and writing the manuscript. |
ISSN: | 1548-7091 1548-7105 1548-7105 |
DOI: | 10.1038/s41592-024-02280-z |