Reducing storage requirements for biological sequence comparison
Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the ‘seed-and-extend’ approach, in which occurrences of short subsequences called ‘seeds’ are used to search for potentially longer matches in a l...
Saved in:
Published in | Bioinformatics Vol. 20; no. 18; pp. 3363 - 3369 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Oxford
Oxford University Press
12.12.2004
Oxford Publishing Limited (England) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the ‘seed-and-extend’ approach, in which occurrences of short subsequences called ‘seeds’ are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. Results: We present a simple and elegant method in which only a small fraction of seeds, called ‘minimizers’, needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds. |
---|---|
Bibliography: | istex:AE0F1E788225F42C23572047BD943AEC724B3366 ark:/67375/HXZ-1CZD454H-G local:bth408 Contact: yorke@ipst.umd.edu, bhunt@ipst.umd.edu ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2 |
ISSN: | 1367-4803 1460-2059 1367-4811 |
DOI: | 10.1093/bioinformatics/bth408 |