On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes chal...
Saved in:
Published in | bioRxiv |
---|---|
Main Authors | , , , , , , , |
Format | Paper |
Language | English |
Published |
Cold Spring Harbor
Cold Spring Harbor Laboratory Press
28.02.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for ten different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all. This suggests that discrepancies in gene annotation is higher than average between strains of those species. We further tested database performance by using mass spectrometry data from eight clinical strains from Mycobacterium tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases. Footnotes * Additional proteomic dataset (S. aureus) was analysed in order to verify approach performance. Additional computational performance was measured to demonstrated advantage of using homology-reduced databases. |
---|---|
AbstractList | In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or poorly genetically characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for ten different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all. This suggests that discrepancies in gene annotation is higher than average between strains of those species. We further tested database performance by using mass spectrometry data from eight clinical strains from Mycobacterium tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases. Footnotes * Additional proteomic dataset (S. aureus) was analysed in order to verify approach performance. Additional computational performance was measured to demonstrated advantage of using homology-reduced databases. |
Author | De Souza, Gustavo A Fonseca, Andre F De Souza, Sandro J Machado, Karla Ct Warren, Robin Wiker, Harald G tuin, Suereta Tomazella, Gisele G |
Author_xml | – sequence: 1 givenname: Karla surname: Machado middlename: Ct fullname: Machado, Karla Ct – sequence: 2 givenname: Suereta surname: tuin fullname: tuin, Suereta – sequence: 3 givenname: Gisele surname: Tomazella middlename: G fullname: Tomazella, Gisele G – sequence: 4 givenname: Andre surname: Fonseca middlename: F fullname: Fonseca, Andre F – sequence: 5 givenname: Robin surname: Warren fullname: Warren, Robin – sequence: 6 givenname: Harald surname: Wiker middlename: G fullname: Wiker, Harald G – sequence: 7 givenname: Sandro surname: De Souza middlename: J fullname: De Souza, Sandro J – sequence: 8 givenname: Gustavo surname: De Souza middlename: A fullname: De Souza, Gustavo A |
BookMark | eNotjjtPAzEQhF1AASH8BkvUAT-487lEES8pUpr00Z69Towu6-PsEzX_HCuhWI1WM5pvbtkVJULG7qV4lFLIJ206Kc0N-90SL0fk8TSCKzyF8zcCHZDSCTmQr0epQImJuI_ZTVhtFzHzn2MckPdzHHykAx-nVDASz_g9IznkHgr0kGsypIn3FYBThEsunQHR5Tt2HWDIuPzXBdu9ve7WH6vN9v1z_bJZja0wK90ZhQLAgte9RwwKDApjnUWtXHDVbAJIlLZHpZX3QXXGPneudV4HLfSCPVxqK73Oy2X_leaJKnGvhGkb0xhh9R-HGl4i |
ContentType | Paper |
Copyright | 2019. Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the associated terms available at https://www.biorxiv.org/content/10.1101/378117v2 |
Copyright_xml | – notice: 2019. Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the associated terms available at https://www.biorxiv.org/content/10.1101/378117v2 |
DBID | 8FE 8FH AAFGM AAMXL ABOIG ABUWG ADZZV AFKRA AFLLJ AFOLM AGAJT AQTIP AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P PIMPY PQCXX PQEST PQQKQ PQUKI PRINS |
DOI | 10.1101/378117 |
DatabaseName | ProQuest SciTech Collection ProQuest Natural Science Collection ProQuest Central Korea - hybrid linking Natural Science Collection - hybrid linking Biological Science Collection - hybrid linking ProQuest Central (Alumni) ProQuest Central (Alumni) - hybrid linking ProQuest Central SciTech Premium Collection - hybrid linking ProQuest Central Student - hybrid linking ProQuest Central Essentials - hybrid linking ProQuest Women's & Gender Studies - hybrid linking ProQuest Central Essentials Biological Science Collection ProQuest Central Natural Science Collection ProQuest One Community College ProQuest Central Korea ProQuest Central Student SciTech Premium Collection Biological Sciences Biological Science Database Publicly Available Content Database ProQuest Central - hybrid linking ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China |
DatabaseTitle | Publicly Available Content Database ProQuest Central Student ProQuest Biological Science Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Natural Science Collection Biological Science Database ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Academic UKI Edition Natural Science Collection ProQuest Central Korea Biological Science Collection ProQuest One Academic |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FH ABUWG AFKRA AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P PIMPY PQEST PQQKQ PQUKI PRINS |
ID | FETCH-LOGICAL-p607-3872e0aa9ad3bdeef2a7e079c9e32cfc2e05fa1e19be232ddf287948c6cd3f303 |
IEDL.DBID | BENPR |
IngestDate | Thu Oct 10 18:35:29 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-p607-3872e0aa9ad3bdeef2a7e079c9e32cfc2e05fa1e19be232ddf287948c6cd3f303 |
OpenAccessLink | https://www.proquest.com/docview/2076575709?pq-origsite=%requestingapplication% |
PQID | 2076575709 |
PQPubID | 2050091 |
ParticipantIDs | proquest_journals_2076575709 |
PublicationCentury | 2000 |
PublicationDate | 20190228 |
PublicationDateYYYYMMDD | 2019-02-28 |
PublicationDate_xml | – month: 02 year: 2019 text: 20190228 day: 28 |
PublicationDecade | 2010 |
PublicationPlace | Cold Spring Harbor |
PublicationPlace_xml | – name: Cold Spring Harbor |
PublicationTitle | bioRxiv |
PublicationYear | 2019 |
Publisher | Cold Spring Harbor Laboratory Press |
Publisher_xml | – name: Cold Spring Harbor Laboratory Press |
Score | 1.5988636 |
Snippet | In proteomics, peptide information within mass spectrometry data from a specific organism sample is routinely challenged against a protein sequence database... |
SourceID | proquest |
SourceType | Aggregation Database |
SubjectTerms | Amino acid sequence Computer applications Homology Mass spectroscopy Mycobacterium tuberculosis Proteomics Species Strains (organisms) Tuberculosis |
Title | On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics |
URI | https://www.proquest.com/docview/2076575709 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3NS8MwFA-6XrwpKn5MycFrsetXkpOgbAzBOWTCbiNJX3AH07pOPPuf-16b4UHw0ENIQyFN3u99_xi7KUyiS6kkGjlQxHlmFF4pWcSA4O9A5yLNqMD5aVZOX_PHZbEMDrc2pFXuZGInqKvako-cPCEUIxCJums-YmKNouhqoNDYZ1E6yilMG92PZ_OXQCKEx-0260op_0jaDj4mhyya6wY2R2wP_DH7fvYc1S7eFyjy2nWjhrL8ff0OHG17fHzdB8k51c1uEDSIRbflX294jbkJZNa8a7Ow9nyXEc0p45OQqeWojXLT92LW_Xt194G1bU_YYjJePEzjwIQQNyX5EaVIIdFa6SozFYBLtYBEKKsgS62zOFk4PYKRMoAaUlU5tINULm1pq8whSJ2yga89nDFeUvMbJ4XOCZeU0oVGC1EmpdEixXXnbLjbqFU4ze3qd-8v_p--ZAeoUKi-5HvIBtvNJ1whaG_NdfgzPxAsneg |
link.rule.ids | 786,790,21416,27956,33777,43838 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV07T8MwELagHWADAeJRwANrRJqX7QkJ1KpAWypUpG6VnZwFA05oipj559wlrhiQGDJYjhXJse-798fYVWpCnUkl0ciBNEhio_BKyTQABH8LOhFRTAXOk2k2ekkeFunCO9xqn1a5kYmNoC7KnHzk5AmhGIEI1U31ERBrFEVXPYXGNutSy03ZYd3bwXT27EmE8Lhdx00p5R9J28DHcI91Z7qC1T7bAnfAvp8cR7WLtwWKvLTNqKIsf1e-A0fbHh9XtkFyTnWzKwQNYtGt-dcrXmNuPJk1b9osvDm-yYjmlPFJyFRz1Ea5aXsx6_a9svnAW14fsvlwML8bBZ4JIagy8iNKEUGotdJFbAoAG2kBoVC5gjjKbY6TqdV96CsDqCEVhUU7SCUyz_IitghSR6zjSgfHjGfU_MZKoRPCJaV0qtFClGFmtIhw3QnrbTZq6U9zvfzd-9P_py_Zzmg-GS_H99PHM7aLyoVqy797rLNefcI5AvjaXPi_9AOqq6De |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=On+the+impact+of+the+pangenome+and+annotation+discrepancies+while+building+protein+sequence+databases+for+bacteria+proteogenomics&rft.jtitle=bioRxiv&rft.au=Machado%2C+Karla+Ct&rft.au=tuin%2C+Suereta&rft.au=Tomazella%2C+Gisele+G&rft.au=Fonseca%2C+Andre+F&rft.date=2019-02-28&rft.pub=Cold+Spring+Harbor+Laboratory+Press&rft_id=info:doi/10.1101%2F378117 |