Efficient indexing of peptides for database search using Tide

The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user contr...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Adoquaye Acquaye, Frank Lawrence Nii, Kertesz-Farkas, Attila, Noble, William Stafford
Format Paper
LanguageEnglish
Published Cold Spring Harbor Cold Spring Harbor Laboratory Press 01.10.2022
Cold Spring Harbor Laboratory
Edition1.1
Subjects
Online AccessGet full text
ISSN2692-8205
2692-8205
DOI10.1101/2022.09.30.510396

Cover

Abstract The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters. Competing Interest Statement The authors have declared no competing interest. Footnotes * http://crux.ms
AbstractList The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters.
The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters. Competing Interest Statement The authors have declared no competing interest. Footnotes * http://crux.ms
Author Frank Lawrence Nii Adoquaye Acquaye
Kertesz-Farkas, Attila
William Stafford Noble
Author_xml – sequence: 1
  givenname: Frank Lawrence Nii
  surname: Adoquaye Acquaye
  fullname: Adoquaye Acquaye, Frank Lawrence Nii
  organization: Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University
– sequence: 2
  givenname: Attila
  surname: Kertesz-Farkas
  fullname: Kertesz-Farkas, Attila
  organization: Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University
– sequence: 3
  givenname: William Stafford
  orcidid: 0000-0001-7283-4715
  surname: Noble
  fullname: Noble, William Stafford
  email: william-noble@uw.edu
  organization: Paul G. Allen School of Computer Science and Engineering, University of Washington
BookMark eNpNjz1PwzAYhC1UJErpD2CLxMKSYL-v48QDA6r4kiqxlNnyJ7iCJNgpKv-eVGVgupPu0enunMy6vvOEXDJaMUbZDVCAisoKaVUzilKckDkICWULtJ7982dkmfOWUgpSMGz4nNzehxBt9N1YxM75fezeij4Ugx_G6HwuQp8Kp0dtdPZF9jrZ92KXD9Rmyi_IadAf2S__dEFeH-43q6dy_fL4vLpbl4ZRLso2tNoYFE6A56axLTS1xWDRQgDLMWg_EcC1Yc5pHixFBtZxcLxGFBoX5PrYa2Kf9vFbDSl-6vSjDs8VlQqpOj6f0KsjOqT-a-fzqLb9LnXTOgUNkxKxbhr8BW6cWZg
ContentType Paper
Copyright 2022. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
2022, Posted by Cold Spring Harbor Laboratory
Copyright_xml – notice: 2022. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
– notice: 2022, Posted by Cold Spring Harbor Laboratory
DBID 8FE
8FH
ABUWG
AFKRA
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
FX.
DOI 10.1101/2022.09.30.510396
DatabaseName ProQuest SciTech Collection
ProQuest Natural Science Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
Biological Science Collection
ProQuest Central
Natural Science Collection
ProQuest One Community College
ProQuest Central Korea
ProQuest Central Student
SciTech Premium Collection
Biological Sciences
Biological Science Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
bioRxiv
DatabaseTitle Publicly Available Content Database
ProQuest Central Student
ProQuest One Academic Middle East (New)
ProQuest Biological Science Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Natural Science Collection
Biological Science Database
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest One Academic UKI Edition
Natural Science Collection
ProQuest Central Korea
Biological Science Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList
Publicly Available Content Database
Database_xml – sequence: 1
  dbid: FX.
  name: bioRxiv
  url: https://www.biorxiv.org/
  sourceTypes: Open Access Repository
– sequence: 2
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Statistics
Biology
EISSN 2692-8205
Edition 1.1
ExternalDocumentID 2022.09.30.510396v1
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FH
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
NQS
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PROAC
RHI
FX.
ID FETCH-LOGICAL-b1046-8f8abb36d62e4b7c8275c3fc3c2f2c43faef8a24ab1dda4fc0312cd42d45336a3
IEDL.DBID FX.
ISSN 2692-8205
IngestDate Tue Jan 07 18:57:04 EST 2025
Fri Jul 25 09:19:21 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
License This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-b1046-8f8abb36d62e4b7c8275c3fc3c2f2c43faef8a24ab1dda4fc0312cd42d45336a3
Notes SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
Competing Interest Statement: The authors have declared no competing interest.
ORCID 0000-0001-7283-4715
OpenAccessLink https://www.biorxiv.org/content/10.1101/2022.09.30.510396
PQID 2719933577
PQPubID 2050091
PageCount 11
ParticipantIDs biorxiv_primary_2022_09_30_510396
proquest_journals_2719933577
PublicationCentury 2000
PublicationDate 20221001
PublicationDateYYYYMMDD 2022-10-01
PublicationDate_xml – month: 10
  year: 2022
  text: 20221001
  day: 01
PublicationDecade 2020
PublicationPlace Cold Spring Harbor
PublicationPlace_xml – name: Cold Spring Harbor
PublicationTitle bioRxiv
PublicationYear 2022
Publisher Cold Spring Harbor Laboratory Press
Cold Spring Harbor Laboratory
Publisher_xml – name: Cold Spring Harbor Laboratory Press
– name: Cold Spring Harbor Laboratory
References Sulimov, Kertész-Farkas (2022.09.30.510396v1.10) 2020; 19
Kang, Lee, Byun, Han, Choi, Hwang, Lee (2022.09.30.510396v1.7) 2021
Eng, Jahan, Hoopmann (2022.09.30.510396v1.1) 2012; 13
Diament, Noble (2022.09.30.510396v1.2) 2011; 10
Kamaliddin, Guillochon, Salnot, Rombaut, Huguet, Guillonneau, Houzé, Cot, Deloron, Argy (2022.09.30.510396v1.6) 2021; 20
Park, Klammer, Käll, MacCoss, Noble (2022.09.30.510396v1.3) 2008; 7
Lin, Short, Noble, Keich (2022.09.30.510396v1.12) 2022
Huebbers, Büttgen, Leissing, Mantz, Pauly, Huesgen, Panstruga (2022.09.30.510396v1.8) 2022; 18
Käll, Canterbury, Weston, Noble, MacCoss (2022.09.30.510396v1.4) 2007; 4
Gao, Ping, Duong, Zhang, Dammer, Li, Chen, Chang, Gao, Wu (2022.09.30.510396v1.5) 2021; 20
He, Li, Fu, Gong, Sun (2022.09.30.510396v1.13) 2018
Stopfer, Mesfin, Joughin, Lauffenburger, White (2022.09.30.510396v1.9) 2020; 11
Elias, Gygi (2022.09.30.510396v1.11) 2007; 4
References_xml – volume: 20
  start-page: 1328
  issue: 2
  year: 2021
  end-page: 1340
  ident: 2022.09.30.510396v1.5
  article-title: Mass-spectrometry-based near-complete draft of the Saccharomyces cerevisiae proteome
  publication-title: In: Journal of Proteome Research
– volume: 13
  start-page: 22
  issue: 1
  year: 2012
  end-page: 24
  ident: 2022.09.30.510396v1.1
  article-title: Comet: an open source tandem mass spectrometry sequence database search tool
  publication-title: In: Proteomics
– volume: 10
  start-page: 3871
  issue: 9
  year: 2011
  end-page: 3879
  ident: 2022.09.30.510396v1.2
  article-title: Faster SEQUEST searching for peptide identification from tandem mass spectra
  publication-title: In: Journal of Proteome Research
– year: 2022
  ident: 2022.09.30.510396v1.12
  article-title: Improving peptide-level mass spectrometry analysis via double competition
  publication-title: In: bioRxiv
– start-page: 5292
  year: 2021
  ident: 2022.09.30.510396v1.7
  article-title: Extracellular vesicles induce aggressive phenotype of luminal breast cancer cells by PKM2 phosphorylation
  publication-title: In: Frontiers in oncology
– volume: 19
  start-page: 1481
  issue: 4
  year: 2020
  end-page: 1490
  ident: 2022.09.30.510396v1.10
  article-title: Tailor: A Nonparametric and Rapid Score Calibration Method for Database Search-Based Peptide Identification in Shotgun Proteomics
  publication-title: In: Journal of Proteome Research
– year: 2018
  ident: 2022.09.30.510396v1.13
  article-title: A direct approach to false discovery rates by decoy permutations
  publication-title: In: arXiv preprint
– volume: 20
  start-page: 1206
  issue: 2
  year: 2021
  end-page: 1216
  ident: 2022.09.30.510396v1.6
  article-title: Comprehensive analysis of transcript and protein relative abundance during blood stages of Plasmodium falciparum infection
  publication-title: In: Journal of Proteome Research
– volume: 18
  start-page: 1
  issue: 1
  year: 2022
  end-page: 23
  ident: 2022.09.30.510396v1.8
  article-title: An advanced method for the release, enrichment and purification of high-quality Arabidopsis thaliana rosette leaf trichomes enables profound insights into the trichome proteome
  publication-title: In: Plant Methods
– volume: 4
  start-page: 207
  issue: 3
  year: 2007
  end-page: 214
  ident: 2022.09.30.510396v1.11
  article-title: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry
  publication-title: In: Nature Methods
– volume: 11
  start-page: 1
  issue: 1
  year: 2020
  end-page: 14
  ident: 2022.09.30.510396v1.9
  article-title: Multiplexed relative and absolute quantitative immunopeptidomics reveals MHC I repertoire alterations induced by CDK4/6 inhibition
  publication-title: In: Nature Communications
– volume: 7
  start-page: 3022
  issue: 7
  year: 2008
  end-page: 3027
  ident: 2022.09.30.510396v1.3
  article-title: Rapid and accurate peptide identification from tandem mass spectra
  publication-title: In: Journal of Proteome Research
– volume: 4
  start-page: 923
  year: 2007
  end-page: 25
  ident: 2022.09.30.510396v1.4
  article-title: A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets
  publication-title: In: Nature Methods
SSID ssj0002961374
Score 1.6691797
SecondaryResourceType preprint
Snippet The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During...
SourceID biorxiv
proquest
SourceType Open Access Repository
Aggregation Database
SubjectTerms Bioinformatics
Digestion
Mass spectroscopy
Peptides
Post-translation
Search engines
Statistics
SummonAdditionalLinks – databaseName: ProQuest Central
  dbid: BENPR
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEF60pdCbVsVqlRW8RpPdzeskKC1FsBRpobewT-mliX2I_ntnkq0eBG-BDTl8O6_MfvsNIbc4tV1ZcCSmIhOI3LpAKdTGczq2Mo0TzvA28sskGc_F8yJe-IbbxtMq9zGxDtSm1Ngjv2cpUs14nKYP1XuAU6PwdNWP0DgkbQjBGdh5-3E4mb7-dFlYDumqlmJmSQ6uz8LYH22CKeKPP0OVUx7eobAcCvd31LJcfy4__oTmOt-Mjkh7Kiu7PiYHdtUjnWZg5FePdLE2bKSVTwgqDy_r-4y01jyEJERLRyvkqRi7oVCOUiSAYqKijUVTpLm_0Rmsn5L5aDh7Ggd-GkKg8Bw2yFwmleKJSZgVKtUZS2MNkHLNHNOCO2nhDSYkgG6kcBrclWkjmBFQ0iWSn5HWqlzZc0JlpHVkVa4iocGB4UGKxDAT5cLZKNZ9cuNhKKpG86JAqIowL3hYNFD1yWAPUOHNflP8btLF_8uXpItfbFhxA9Larnf2CrL7Vl37LfwG5cOh6w
  priority: 102
  providerName: ProQuest
Title Efficient indexing of peptides for database search using Tide
URI https://www.proquest.com/docview/2719933577
https://www.biorxiv.org/content/10.1101/2022.09.30.510396
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEB60RfDmEx-1rOA1JftI0lyVliJYirTQ27JP6aUtbRX9984kUQQ9eAt5ki8zO7OZb78BuKOu7TagIwnLfaLKEBNrSRsvuiyYIsuloNXIT-N8NFOP82z-o9UX0SrtYrV5X7xVdXwibOPoWzt3ymmuLkiYVKY90oIr831oo0kJ6townPe-f6-IEuNUoZo65p9XYsbbPOnXOFwFl-ERtCdmHTbHsBeWJ3BQd4f8OAVSFl5U6xVZpWmIQYatIlsTD8WHLcN0kxHBkwIRqy2WEY39hU3x-BnMhoPpwyhpuh0kluqsST_2jbUy97kIyhauL4rMIWTSiSicktEEPEMog6B6o6JDdxTOK-EVpmy5kefQWq6W4QKY4c7xYEvLlUMHxQ2jci88L1UMPHOXcNu8uV7Xmhaa0NFpqWWqa3QuofOFiW7MeqtFQXw_mRXF1T9ucQ2HtK-mvnWgtdu8hhsM4Tvbhfb9YDx57lYf7ROH8pe3
linkProvider Cold Spring Harbor Laboratory Press
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LS8NAEB60RfSmVfHtCnqMNpNN0hxE8EVrtYhU8Bb3Femlra3PP-VvdCZJ9SB46y2wYVm-nZ2ZnZ35BmCfu7ZrRwcJtW89mbjM05q58TITOhWHUYBcjXzTiZr38uohfJiBr0ktDKdVTnRirqjtwHCM_AhjTjULwjg-GT573DWKX1cnLTQKsWi7z3e6so2PW-e0vweIlxfds6ZXdhXwNL9neo2sobQOIhuhkzo2DYxDQ0sLDGZoZJApR3-gVLR4q2RmSOzRWIlWkmsUqYDmnYWq5IrWClRPLzq3dz9RHUzIPObUzxglpGqwHpZPqST6HGhAZlUN6odMZMeNAuZ0bzD66L39MQW5fbtchOqtGrrREsy4fg3migaVnzVYYF-0oHJeBmY67uX1kyLnWCSjJwaZGHJejHVjQe6v4IRTNoyigEpwWv2T6NL4CtxPBadVqPQHfbcGQvnG-E4n2peGFAZ9KBlZtH4iM-eHZh32ShjSYcGxkTJUaT1Jg3paQLUOWxOA0vKYjdNfodj4f3gX5pvdm-v0utVpb8ICz15k5G1B5WX06rbJs3jRO-V2CnictgR9A23k4CA
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF7UonjzidWqK3hNyD6SNGc11FfpoYXeln1KL21oq-i_dyaJIujBW2DDhp3M7MzufPMNIdfYtd14MCRumItk4UNkDHLjBZt6naeZ4FiN_DzMBhP5ME2nP2phEFZpZovl--ytzuMjYBt238a4E4ZndY7EpCKJkQuuyGK8po4rFzZJB3SLoWaX0_j7noUX4LBy2SY0_5wCQt_2k7825NrLlHukM9KVX-6TDT8_INtNm8iPQ4IUw7O6cJHW5Ibgbegi0AoBKc6vKMSdFJGe6JFoo7oU8ewvdAzjR2RS3o1vBlHb9iAymHCN-qGvjRGZy7iXJrd9nqcWZCcsD9xKEbSHN7jUIF2nZbBgl9w6yZ2E2C3T4phszRdzf0KoZtYybwrDpAVLhQctM8cdK2TwLLVdctWuXFUNuYVC6aikUCJRjXS6pPclE9Xq90rxHIF_Is3z039McUl2RrelerofPp6RXRxu4HA9srVevvpzcOtrc1H_t09tk5wF
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Efficient+indexing+of+peptides+for+database+search+using+Tide&rft.jtitle=bioRxiv&rft.au=Adoquaye+Acquaye%2C+Frank+Lawrence+Nii&rft.au=Kertesz-Farkas%2C+Attila&rft.au=Noble%2C+William+Stafford&rft.date=2022-10-01&rft.pub=Cold+Spring+Harbor+Laboratory&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F2022.09.30.510396&rft.externalDocID=2022.09.30.510396v1
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2692-8205&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2692-8205&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2692-8205&client=summon