Reducing storage requirements for biological sequence comparison

Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the ‘seed-and-extend’ approach, in which occurrences of short subsequences called ‘seeds’ are used to search for potentially longer matches in a l...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 20; no. 18; pp. 3363 - 3369
Main Authors	Roberts, Michael, Hayes, Wayne, Hunt, Brian R., Mount, Stephen M., Yorke, James A.
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 12.12.2004 Oxford Publishing Limited (England)
Subjects	Algorithms Biological and medical sciences Databases, Genetic Fundamental and applied biological sciences. Psychology General aspects Information Storage and Retrieval - methods Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Numerical Analysis, Computer-Assisted Sequence Alignment - methods Sequence Analysis - methods Bioinformatics Storage
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the ‘seed-and-extend’ approach, in which occurrences of short subsequences called ‘seeds’ are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. Results: We present a simple and elegant method in which only a small fraction of seeds, called ‘minimizers’, needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.
Bibliography:	istex:AE0F1E788225F42C23572047BD943AEC724B3366 ark:/67375/HXZ-1CZD454H-G local:bth408 Contact: yorke@ipst.umd.edu, bhunt@ipst.umd.edu ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	1367-4803 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/bth408