Indexing and searching petabase-scale nucleotide resources

Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by prov...

Full description

Saved in:
Bibliographic Details
Published inNature methods Vol. 21; no. 6; pp. 994 - 1002
Main Authors Shiryev, Sergey A., Agarwala, Richa
Format Journal Article
LanguageEnglish
Published New York Nature Publishing Group US 01.06.2024
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash. The Pebblescout tool achieves an efficient search for subjects in a large nucleotide database such as runs in Sequence Read Archive data.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Author contribution S.A.S. proposed the presented solution, designed, and implemented the software. R.A. managed the project, did testing, and found applications. Both authors contributed to building databases, data interpretation, and writing the manuscript.
ISSN:1548-7091
1548-7105
1548-7105
DOI:10.1038/s41592-024-02280-z