On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics

In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes chal...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Machado, Karla Ct, tuin, Suereta, Tomazella, Gisele G, Fonseca, Andre F, Warren, Robin, Wiker, Harald G, De Souza, Sandro J, De Souza, Gustavo A
Format Paper
LanguageEnglish
Published Cold Spring Harbor Cold Spring Harbor Laboratory Press 28.02.2019
Subjects
Online AccessGet full text

Cover

Loading…
Abstract In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for ten different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all. This suggests that discrepancies in gene annotation is higher than average between strains of those species. We further tested database performance by using mass spectrometry data from eight clinical strains from Mycobacterium tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases. Footnotes * Additional proteomic dataset (S. aureus) was analysed in order to verify approach performance. Additional computational performance was measured to demonstrated advantage of using homology-reduced databases.
AbstractList In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for ten different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all. This suggests that discrepancies in gene annotation is higher than average between strains of those species. We further tested database performance by using mass spectrometry data from eight clinical strains from Mycobacterium tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases. Footnotes * Additional proteomic dataset (S. aureus) was analysed in order to verify approach performance. Additional computational performance was measured to demonstrated advantage of using homology-reduced databases.
Author De Souza, Gustavo A
Fonseca, Andre F
De Souza, Sandro J
Machado, Karla Ct
Warren, Robin
Wiker, Harald G
tuin, Suereta
Tomazella, Gisele G
Author_xml – sequence: 1
  givenname: Karla
  surname: Machado
  middlename: Ct
  fullname: Machado, Karla Ct
– sequence: 2
  givenname: Suereta
  surname: tuin
  fullname: tuin, Suereta
– sequence: 3
  givenname: Gisele
  surname: Tomazella
  middlename: G
  fullname: Tomazella, Gisele G
– sequence: 4
  givenname: Andre
  surname: Fonseca
  middlename: F
  fullname: Fonseca, Andre F
– sequence: 5
  givenname: Robin
  surname: Warren
  fullname: Warren, Robin
– sequence: 6
  givenname: Harald
  surname: Wiker
  middlename: G
  fullname: Wiker, Harald G
– sequence: 7
  givenname: Sandro
  surname: De Souza
  middlename: J
  fullname: De Souza, Sandro J
– sequence: 8
  givenname: Gustavo
  surname: De Souza
  middlename: A
  fullname: De Souza, Gustavo A
BookMark eNotjjtPAzEQhF1AASH8BkvUAT-487lEES8pUpr00Z69Towu6-PsEzX_HCuhWI1WM5pvbtkVJULG7qV4lFLIJ206Kc0N-90SL0fk8TSCKzyF8zcCHZDSCTmQr0epQImJuI_ZTVhtFzHzn2MckPdzHHykAx-nVDASz_g9IznkHgr0kGsypIn3FYBThEsunQHR5Tt2HWDIuPzXBdu9ve7WH6vN9v1z_bJZja0wK90ZhQLAgte9RwwKDApjnUWtXHDVbAJIlLZHpZX3QXXGPneudV4HLfSCPVxqK73Oy2X_leaJKnGvhGkb0xhh9R-HGl4i
ContentType Paper
Copyright 2019. Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the associated terms available at https://www.biorxiv.org/content/10.1101/378117v2
Copyright_xml – notice: 2019. Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the associated terms available at https://www.biorxiv.org/content/10.1101/378117v2
DBID 8FE
8FH
AAFGM
AAMXL
ABOIG
ABUWG
ADZZV
AFKRA
AFLLJ
AFOLM
AGAJT
AQTIP
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PIMPY
PQCXX
PQEST
PQQKQ
PQUKI
PRINS
DOI 10.1101/378117
DatabaseName ProQuest SciTech Collection
ProQuest Natural Science Collection
ProQuest Central Korea - hybrid linking
Natural Science Collection - hybrid linking
Biological Science Collection - hybrid linking
ProQuest Central (Alumni)
ProQuest Central (Alumni) - hybrid linking
ProQuest Central
SciTech Premium Collection - hybrid linking
ProQuest Central Student - hybrid linking
ProQuest Central Essentials - hybrid linking
ProQuest Women's & Gender Studies - hybrid linking
ProQuest Central Essentials
Biological Science Collection
ProQuest Central
Natural Science Collection
ProQuest One Community College
ProQuest Central Korea
ProQuest Central Student
SciTech Premium Collection
Biological Sciences
Biological Science Database
Publicly Available Content Database
ProQuest Central - hybrid linking
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
DatabaseTitle Publicly Available Content Database
ProQuest Central Student
ProQuest Biological Science Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Natural Science Collection
Biological Science Database
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Academic UKI Edition
Natural Science Collection
ProQuest Central Korea
Biological Science Collection
ProQuest One Academic
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FH
ABUWG
AFKRA
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
ID FETCH-LOGICAL-p607-3872e0aa9ad3bdeef2a7e079c9e32cfc2e05fa1e19be232ddf287948c6cd3f303
IEDL.DBID BENPR
IngestDate Thu Oct 10 18:35:29 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-p607-3872e0aa9ad3bdeef2a7e079c9e32cfc2e05fa1e19be232ddf287948c6cd3f303
OpenAccessLink https://www.proquest.com/docview/2076575709?pq-origsite=%requestingapplication%
PQID 2076575709
PQPubID 2050091
ParticipantIDs proquest_journals_2076575709
PublicationCentury 2000
PublicationDate 20190228
PublicationDateYYYYMMDD 2019-02-28
PublicationDate_xml – month: 02
  year: 2019
  text: 20190228
  day: 28
PublicationDecade 2010
PublicationPlace Cold Spring Harbor
PublicationPlace_xml – name: Cold Spring Harbor
PublicationTitle bioRxiv
PublicationYear 2019
Publisher Cold Spring Harbor Laboratory Press
Publisher_xml – name: Cold Spring Harbor Laboratory Press
Score 1.5988636
Snippet In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database...
SourceID proquest
SourceType Aggregation Database
SubjectTerms Amino acid sequence
Computer applications
Homology
Mass spectroscopy
Mycobacterium tuberculosis
Proteomics
Species
Strains (organisms)
Tuberculosis
Title On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
URI https://www.proquest.com/docview/2076575709
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3NS8MwFA-6XrwpKn5MycFrsetXkpOgbAzBOWTCbiNJX3AH07pOPPuf-16b4UHw0ENIQyFN3u99_xi7KUyiS6kkGjlQxHlmFF4pWcSA4O9A5yLNqMD5aVZOX_PHZbEMDrc2pFXuZGInqKvako-cPCEUIxCJums-YmKNouhqoNDYZ1E6yilMG92PZ_OXQCKEx-0260op_0jaDj4mhyya6wY2R2wP_DH7fvYc1S7eFyjy2nWjhrL8ff0OHG17fHzdB8k51c1uEDSIRbflX294jbkJZNa8a7Ow9nyXEc0p45OQqeWojXLT92LW_Xt194G1bU_YYjJePEzjwIQQNyX5EaVIIdFa6SozFYBLtYBEKKsgS62zOFk4PYKRMoAaUlU5tINULm1pq8whSJ2yga89nDFeUvMbJ4XOCZeU0oVGC1EmpdEixXXnbLjbqFU4ze3qd-8v_p--ZAeoUKi-5HvIBtvNJ1whaG_NdfgzPxAsneg
link.rule.ids 786,790,21416,27956,33777,43838
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV07T8MwELagHWADAeJRwANrRJqX7QkJ1KpAWypUpG6VnZwFA05oipj559wlrhiQGDJYjhXJse-798fYVWpCnUkl0ciBNEhio_BKyTQABH8LOhFRTAXOk2k2ekkeFunCO9xqn1a5kYmNoC7KnHzk5AmhGIEI1U31ERBrFEVXPYXGNutSy03ZYd3bwXT27EmE8Lhdx00p5R9J28DHcI91Z7qC1T7bAnfAvp8cR7WLtwWKvLTNqKIsf1e-A0fbHh9XtkFyTnWzKwQNYtGt-dcrXmNuPJk1b9osvDm-yYjmlPFJyFRz1Ea5aXsx6_a9svnAW14fsvlwML8bBZ4JIagy8iNKEUGotdJFbAoAG2kBoVC5gjjKbY6TqdV96CsDqCEVhUU7SCUyz_IitghSR6zjSgfHjGfU_MZKoRPCJaV0qtFClGFmtIhw3QnrbTZq6U9zvfzd-9P_py_Zzmg-GS_H99PHM7aLyoVqy797rLNefcI5AvjaXPi_9AOqq6De
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=On+the+impact+of+the+pangenome+and+annotation+discrepancies+while+building+protein+sequence+databases+for+bacteria+proteogenomics&rft.jtitle=bioRxiv&rft.au=Machado%2C+Karla+Ct&rft.au=tuin%2C+Suereta&rft.au=Tomazella%2C+Gisele+G&rft.au=Fonseca%2C+Andre+F&rft.date=2019-02-28&rft.pub=Cold+Spring+Harbor+Laboratory+Press&rft_id=info:doi/10.1101%2F378117