Efficient indexing of peptides for database search using Tide

The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user contr...

Full description

Saved in:

Bibliographic Details
Published in	bioRxiv
Main Authors	Adoquaye Acquaye, Frank Lawrence Nii, Kertesz-Farkas, Attila, Noble, William Stafford
Format	Paper
Language	English
Published	Cold Spring Harbor Cold Spring Harbor Laboratory Press 01.10.2022 Cold Spring Harbor Laboratory
Edition	1.1
Subjects	Bioinformatics Digestion Mass spectroscopy Peptides Post-translation Search engines Statistics
Online Access	Get full text
ISSN	2692-8205 2692-8205
DOI	10.1101/2022.09.30.510396

Cover

Abstract	The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters. Competing Interest Statement The authors have declared no competing interest. Footnotes * http://crux.ms
AbstractList	The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters. The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters. Competing Interest Statement The authors have declared no competing interest. Footnotes * http://crux.ms
Author	Frank Lawrence Nii Adoquaye Acquaye Kertesz-Farkas, Attila William Stafford Noble
Author_xml	– sequence: 1 givenname: Frank Lawrence Nii surname: Adoquaye Acquaye fullname: Adoquaye Acquaye, Frank Lawrence Nii organization: Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University – sequence: 2 givenname: Attila surname: Kertesz-Farkas fullname: Kertesz-Farkas, Attila organization: Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University – sequence: 3 givenname: William Stafford orcidid: 0000-0001-7283-4715 surname: Noble fullname: Noble, William Stafford email: william-noble@uw.edu organization: Paul G. Allen School of Computer Science and Engineering, University of Washington
BookMark	eNpNjz1PwzAYhC1UJErpD2CLxMKSYL-v48QDA6r4kiqxlNnyJ7iCJNgpKv-eVGVgupPu0enunMy6vvOEXDJaMUbZDVCAisoKaVUzilKckDkICWULtJ7982dkmfOWUgpSMGz4nNzehxBt9N1YxM75fezeij4Ugx_G6HwuQp8Kp0dtdPZF9jrZ92KXD9Rmyi_IadAf2S__dEFeH-43q6dy_fL4vLpbl4ZRLso2tNoYFE6A56axLTS1xWDRQgDLMWg_EcC1Yc5pHixFBtZxcLxGFBoX5PrYa2Kf9vFbDSl-6vSjDs8VlQqpOj6f0KsjOqT-a-fzqLb9LnXTOgUNkxKxbhr8BW6cWZg
ContentType	Paper
Copyright	2022. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. 2022, Posted by Cold Spring Harbor Laboratory
Copyright_xml	– notice: 2022. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: 2022, Posted by Cold Spring Harbor Laboratory
DBID	8FE 8FH ABUWG AFKRA AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS FX.
DOI	10.1101/2022.09.30.510396
DatabaseName	ProQuest SciTech Collection ProQuest Natural Science Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials Biological Science Collection ProQuest Central Natural Science Collection ProQuest One Community College ProQuest Central Korea ProQuest Central Student SciTech Premium Collection Biological Sciences Biological Science Database ProQuest Central Premium ProQuest One Academic Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China bioRxiv
DatabaseTitle	Publicly Available Content Database ProQuest Central Student ProQuest One Academic Middle East (New) ProQuest Biological Science Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Natural Science Collection Biological Science Database ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest One Academic UKI Edition Natural Science Collection ProQuest Central Korea Biological Science Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New)
DatabaseTitleList	Publicly Available Content Database
Database_xml	– sequence: 1 dbid: FX. name: bioRxiv url: https://www.biorxiv.org/ sourceTypes: Open Access Repository – sequence: 2 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Statistics Biology
EISSN	2692-8205
Edition	1.1
ExternalDocumentID	2022.09.30.510396v1
Genre	Working Paper/Pre-Print
GroupedDBID	8FE 8FH ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P NQS PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PROAC RHI FX.
ID	FETCH-LOGICAL-b1046-8f8abb36d62e4b7c8275c3fc3c2f2c43faef8a24ab1dda4fc0312cd42d45336a3
IEDL.DBID	FX.
ISSN	2692-8205
IngestDate	Tue Jan 07 18:57:04 EST 2025 Fri Jul 25 09:19:21 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
License	This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-b1046-8f8abb36d62e4b7c8275c3fc3c2f2c43faef8a24ab1dda4fc0312cd42d45336a3
Notes	SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 Competing Interest Statement: The authors have declared no competing interest.
ORCID	0000-0001-7283-4715
OpenAccessLink	https://www.biorxiv.org/content/10.1101/2022.09.30.510396
PQID	2719933577
PQPubID	2050091
PageCount	11
ParticipantIDs	biorxiv_primary_2022_09_30_510396 proquest_journals_2719933577
PublicationCentury	2000
PublicationDate	20221001
PublicationDateYYYYMMDD	2022-10-01
PublicationDate_xml	– month: 10 year: 2022 text: 20221001 day: 01
PublicationDecade	2020
PublicationPlace	Cold Spring Harbor
PublicationPlace_xml	– name: Cold Spring Harbor
PublicationTitle	bioRxiv
PublicationYear	2022
Publisher	Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory
Publisher_xml	– name: Cold Spring Harbor Laboratory Press – name: Cold Spring Harbor Laboratory
References	Sulimov, Kertész-Farkas (2022.09.30.510396v1.10) 2020; 19 Kang, Lee, Byun, Han, Choi, Hwang, Lee (2022.09.30.510396v1.7) 2021 Eng, Jahan, Hoopmann (2022.09.30.510396v1.1) 2012; 13 Diament, Noble (2022.09.30.510396v1.2) 2011; 10 Kamaliddin, Guillochon, Salnot, Rombaut, Huguet, Guillonneau, Houzé, Cot, Deloron, Argy (2022.09.30.510396v1.6) 2021; 20 Park, Klammer, Käll, MacCoss, Noble (2022.09.30.510396v1.3) 2008; 7 Lin, Short, Noble, Keich (2022.09.30.510396v1.12) 2022 Huebbers, Büttgen, Leissing, Mantz, Pauly, Huesgen, Panstruga (2022.09.30.510396v1.8) 2022; 18 Käll, Canterbury, Weston, Noble, MacCoss (2022.09.30.510396v1.4) 2007; 4 Gao, Ping, Duong, Zhang, Dammer, Li, Chen, Chang, Gao, Wu (2022.09.30.510396v1.5) 2021; 20 He, Li, Fu, Gong, Sun (2022.09.30.510396v1.13) 2018 Stopfer, Mesfin, Joughin, Lauffenburger, White (2022.09.30.510396v1.9) 2020; 11 Elias, Gygi (2022.09.30.510396v1.11) 2007; 4
References_xml	– volume: 20 start-page: 1328 issue: 2 year: 2021 end-page: 1340 ident: 2022.09.30.510396v1.5 article-title: Mass-spectrometry-based near-complete draft of the Saccharomyces cerevisiae proteome publication-title: In: Journal of Proteome Research – volume: 13 start-page: 22 issue: 1 year: 2012 end-page: 24 ident: 2022.09.30.510396v1.1 article-title: Comet: an open source tandem mass spectrometry sequence database search tool publication-title: In: Proteomics – volume: 10 start-page: 3871 issue: 9 year: 2011 end-page: 3879 ident: 2022.09.30.510396v1.2 article-title: Faster SEQUEST searching for peptide identification from tandem mass spectra publication-title: In: Journal of Proteome Research – year: 2022 ident: 2022.09.30.510396v1.12 article-title: Improving peptide-level mass spectrometry analysis via double competition publication-title: In: bioRxiv – start-page: 5292 year: 2021 ident: 2022.09.30.510396v1.7 article-title: Extracellular vesicles induce aggressive phenotype of luminal breast cancer cells by PKM2 phosphorylation publication-title: In: Frontiers in oncology – volume: 19 start-page: 1481 issue: 4 year: 2020 end-page: 1490 ident: 2022.09.30.510396v1.10 article-title: Tailor: A Nonparametric and Rapid Score Calibration Method for Database Search-Based Peptide Identification in Shotgun Proteomics publication-title: In: Journal of Proteome Research – year: 2018 ident: 2022.09.30.510396v1.13 article-title: A direct approach to false discovery rates by decoy permutations publication-title: In: arXiv preprint – volume: 20 start-page: 1206 issue: 2 year: 2021 end-page: 1216 ident: 2022.09.30.510396v1.6 article-title: Comprehensive analysis of transcript and protein relative abundance during blood stages of Plasmodium falciparum infection publication-title: In: Journal of Proteome Research – volume: 18 start-page: 1 issue: 1 year: 2022 end-page: 23 ident: 2022.09.30.510396v1.8 article-title: An advanced method for the release, enrichment and purification of high-quality Arabidopsis thaliana rosette leaf trichomes enables profound insights into the trichome proteome publication-title: In: Plant Methods – volume: 4 start-page: 207 issue: 3 year: 2007 end-page: 214 ident: 2022.09.30.510396v1.11 article-title: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry publication-title: In: Nature Methods – volume: 11 start-page: 1 issue: 1 year: 2020 end-page: 14 ident: 2022.09.30.510396v1.9 article-title: Multiplexed relative and absolute quantitative immunopeptidomics reveals MHC I repertoire alterations induced by CDK4/6 inhibition publication-title: In: Nature Communications – volume: 7 start-page: 3022 issue: 7 year: 2008 end-page: 3027 ident: 2022.09.30.510396v1.3 article-title: Rapid and accurate peptide identification from tandem mass spectra publication-title: In: Journal of Proteome Research – volume: 4 start-page: 923 year: 2007 end-page: 25 ident: 2022.09.30.510396v1.4 article-title: A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets publication-title: In: Nature Methods
SSID	ssj0002961374
Score	1.6691797
SecondaryResourceType	preprint
Snippet	The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During...
SourceID	biorxiv proquest
SourceType	Open Access Repository Aggregation Database
SubjectTerms	Bioinformatics Digestion Mass spectroscopy Peptides Post-translation Search engines Statistics
SummonAdditionalLinks	– databaseName: ProQuest Central dbid: BENPR link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEF60pdCbVsVqlRW8RpPdzeskKC1FsBRpobewT-mliX2I_ntnkq0eBG-BDTl8O6_MfvsNIbc4tV1ZcCSmIhOI3LpAKdTGczq2Mo0TzvA28sskGc_F8yJe-IbbxtMq9zGxDtSm1Ngjv2cpUs14nKYP1XuAU6PwdNWP0DgkbQjBGdh5-3E4mb7-dFlYDumqlmJmSQ6uz8LYH22CKeKPP0OVUx7eobAcCvd31LJcfy4__oTmOt-Mjkh7Kiu7PiYHdtUjnWZg5FePdLE2bKSVTwgqDy_r-4y01jyEJERLRyvkqRi7oVCOUiSAYqKijUVTpLm_0Rmsn5L5aDh7Ggd-GkKg8Bw2yFwmleKJSZgVKtUZS2MNkHLNHNOCO2nhDSYkgG6kcBrclWkjmBFQ0iWSn5HWqlzZc0JlpHVkVa4iocGB4UGKxDAT5cLZKNZ9cuNhKKpG86JAqIowL3hYNFD1yWAPUOHNflP8btLF_8uXpItfbFhxA9Larnf2CrL7Vl37LfwG5cOh6w priority: 102 providerName: ProQuest
Title	Efficient indexing of peptides for database search using Tide
URI	https://www.proquest.com/docview/2719933577 https://www.biorxiv.org/content/10.1101/2022.09.30.510396
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEB60RfDmEx-1rOA1JftI0lyVliJYirTQ27JP6aUtbRX9984kUQQ9eAt5ki8zO7OZb78BuKOu7TagIwnLfaLKEBNrSRsvuiyYIsuloNXIT-N8NFOP82z-o9UX0SrtYrV5X7xVdXwibOPoWzt3ymmuLkiYVKY90oIr831oo0kJ6townPe-f6-IEuNUoZo65p9XYsbbPOnXOFwFl-ERtCdmHTbHsBeWJ3BQd4f8OAVSFl5U6xVZpWmIQYatIlsTD8WHLcN0kxHBkwIRqy2WEY39hU3x-BnMhoPpwyhpuh0kluqsST_2jbUy97kIyhauL4rMIWTSiSicktEEPEMog6B6o6JDdxTOK-EVpmy5kefQWq6W4QKY4c7xYEvLlUMHxQ2jci88L1UMPHOXcNu8uV7Xmhaa0NFpqWWqa3QuofOFiW7MeqtFQXw_mRXF1T9ucQ2HtK-mvnWgtdu8hhsM4Tvbhfb9YDx57lYf7ROH8pe3
linkProvider	Cold Spring Harbor Laboratory Press
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LS8NAEB60RfSmVfHtCnqMNpNN0hxE8EVrtYhU8Bb3Femlra3PP-VvdCZJ9SB46y2wYVm-nZ2ZnZ35BmCfu7ZrRwcJtW89mbjM05q58TITOhWHUYBcjXzTiZr38uohfJiBr0ktDKdVTnRirqjtwHCM_AhjTjULwjg-GT573DWKX1cnLTQKsWi7z3e6so2PW-e0vweIlxfds6ZXdhXwNL9neo2sobQOIhuhkzo2DYxDQ0sLDGZoZJApR3-gVLR4q2RmSOzRWIlWkmsUqYDmnYWq5IrWClRPLzq3dz9RHUzIPObUzxglpGqwHpZPqST6HGhAZlUN6odMZMeNAuZ0bzD66L39MQW5fbtchOqtGrrREsy4fg3migaVnzVYYF-0oHJeBmY67uX1kyLnWCSjJwaZGHJejHVjQe6v4IRTNoyigEpwWv2T6NL4CtxPBadVqPQHfbcGQvnG-E4n2peGFAZ9KBlZtH4iM-eHZh32ShjSYcGxkTJUaT1Jg3paQLUOWxOA0vKYjdNfodj4f3gX5pvdm-v0utVpb8ICz15k5G1B5WX06rbJs3jRO-V2CnictgR9A23k4CA
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF7UonjzidWqK3hNyD6SNGc11FfpoYXeln1KL21oq-i_dyaJIujBW2DDhp3M7MzufPMNIdfYtd14MCRumItk4UNkDHLjBZt6naeZ4FiN_DzMBhP5ME2nP2phEFZpZovl--ytzuMjYBt238a4E4ZndY7EpCKJkQuuyGK8po4rFzZJB3SLoWaX0_j7noUX4LBy2SY0_5wCQt_2k7825NrLlHukM9KVX-6TDT8_INtNm8iPQ4IUw7O6cJHW5Ibgbegi0AoBKc6vKMSdFJGe6JFoo7oU8ewvdAzjR2RS3o1vBlHb9iAymHCN-qGvjRGZy7iXJrd9nqcWZCcsD9xKEbSHN7jUIF2nZbBgl9w6yZ2E2C3T4phszRdzf0KoZtYybwrDpAVLhQctM8cdK2TwLLVdctWuXFUNuYVC6aikUCJRjXS6pPclE9Xq90rxHIF_Is3z039McUl2RrelerofPp6RXRxu4HA9srVevvpzcOtrc1H_t09tk5wF
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Efficient+indexing+of+peptides+for+database+search+using+Tide&rft.jtitle=bioRxiv&rft.au=Adoquaye+Acquaye%2C+Frank+Lawrence+Nii&rft.au=Kertesz-Farkas%2C+Attila&rft.au=Noble%2C+William+Stafford&rft.date=2022-10-01&rft.pub=Cold+Spring+Harbor+Laboratory&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F2022.09.30.510396&rft.externalDocID=2022.09.30.510396v1
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2692-8205&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2692-8205&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2692-8205&client=summon