Comparison of gene set scoring methods for reproducible evaluation of multiple tuberculosis gene signatures

Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Wang, Xutao, VanValkenberg, Arthur, Odom-Mabey, Aubrey R, Ellner, Jerrold J, Hochberg, Natasha S, Salgame, Padmini, Patil, Prasad, Johnson, W Evan
Format Journal Article Paper
LanguageEnglish
Published United States Cold Spring Harbor Laboratory Press 30.01.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed. We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use. We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models. For many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies. Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.
AbstractList Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed.RationaleMany blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed.We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use.ObjectivesWe compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use.We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models.MethodsWe considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models.For many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies.Measurement and Main ResultsFor many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies.Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.ConclusionGene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.
Rationale: Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures′ replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed. Objectives: We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use. Methods: We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature′s performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models. Measurement and Main Results: For many signatures, the gene set scoring method predictions were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies. Conclusion: Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.Competing Interest StatementThe authors have declared no competing interest.Footnotes* author updated
Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed. We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use. We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models. For many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies. Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.
Author Salgame, Padmini
VanValkenberg, Arthur
Patil, Prasad
Johnson, W Evan
Wang, Xutao
Ellner, Jerrold J
Odom-Mabey, Aubrey R
Hochberg, Natasha S
Author_xml – sequence: 1
  givenname: Xutao
  surname: Wang
  fullname: Wang, Xutao
  organization: Division of Computational Biomedicine and Bioinformatics Program, Boston University, Boston, MA, USA
– sequence: 2
  givenname: Arthur
  surname: VanValkenberg
  fullname: VanValkenberg, Arthur
  organization: Division of Computational Biomedicine and Bioinformatics Program, Boston University, Boston, MA, USA
– sequence: 3
  givenname: Aubrey R
  surname: Odom-Mabey
  fullname: Odom-Mabey, Aubrey R
  organization: Division of Computational Biomedicine and Bioinformatics Program, Boston University, Boston, MA, USA
– sequence: 4
  givenname: Jerrold J
  surname: Ellner
  fullname: Ellner, Jerrold J
  organization: Department of Medicine, Center for Emerging Pathogens, Rutgers New Jersey Medical School, Newark, NJ, USA
– sequence: 5
  givenname: Natasha S
  surname: Hochberg
  fullname: Hochberg, Natasha S
  organization: Section of Infectious Diseases, Boston University School of Medicine, Boston, MA, USA
– sequence: 6
  givenname: Padmini
  surname: Salgame
  fullname: Salgame, Padmini
  organization: Department of Medicine, Center for Emerging Pathogens, Rutgers New Jersey Medical School, Newark, NJ, USA
– sequence: 7
  givenname: Prasad
  surname: Patil
  fullname: Patil, Prasad
  organization: Department of Biostatistics, Boston University, Boston, MA, USA
– sequence: 8
  givenname: W Evan
  surname: Johnson
  fullname: Johnson, W Evan
  organization: Division of Infectious Disease, Center for Data Science, Rutgers New Jersey Medical School, Newark, NJ, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/36711818$$D View this record in MEDLINE/PubMed
BookMark eNpdkE9Lw0AQxRep2Fr7AbxIwIuXxp3ZTbI5SvEfFLzoOWySSd2aZONuVvDbG2gF8TSP4fcej3fOZr3tibFL4DEAh1vkKGIOMeRxgjzF7IQtMM1xrZAnsz96zlbe7znnmKcgMnnG5iLNABSoBfvY2G7QznjbR7aJdtRT5GmMfGWd6XdRR-O7rX3UWBc5GpytQ2XKliL60m3Qozn4utCOZpjeYyjJVaG13vhjmtn1egyO_AU7bXTraXW8S_b2cP-6eVpvXx6fN3fbdYWCZ2vZlBWCpCStgECiBKV1QlQ2kNc6qXJeIhdapzknnlCuAGtZS405V5LKWizZzSF3qvsZyI9FZ3xFbat7ssEXmGXAVYYCJ_T6H7q3wfVTu4lKM5GgUGqiro5UKDuqi8GZTrvv4ndG8QOdNXgR
CitedBy_id crossref_primary_10_3389_fimmu_2023_1210372
ContentType Journal Article
Paper
Copyright 2023. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2023. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID NPM
8FE
8FH
AAFGM
AAMXL
ABOIG
ABUWG
ADZZV
AFKRA
AFLLJ
AFOLM
AGAJT
AQTIP
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
COVID
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PIMPY
PQCXX
PQEST
PQQKQ
PQUKI
PRINS
7X8
DOI 10.1101/2023.01.19.520627
DatabaseName PubMed
ProQuest SciTech Collection
ProQuest Natural Science Collection
ProQuest Central Korea - hybrid linking
Natural Science Collection - hybrid linking
Biological Science Collection - hybrid linking
ProQuest Central (Alumni)
ProQuest Central (Alumni) - hybrid linking
ProQuest Central
SciTech Premium Collection - hybrid linking
ProQuest Central Student - hybrid linking
ProQuest Central Essentials - hybrid linking
ProQuest Women's & Gender Studies - hybrid linking
ProQuest Central Essentials
Biological Science Collection
AUTh Library subscriptions: ProQuest Central
ProQuest Natural Science Collection
ProQuest One Community College
Coronavirus Research Database
ProQuest Central
ProQuest Central Student
SciTech Premium Collection (Proquest) (PQ_SDU_P3)
Biological Sciences
Biological Science Database
Access via ProQuest (Open Access)
ProQuest Central - hybrid linking
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
MEDLINE - Academic
DatabaseTitle PubMed
Publicly Available Content Database
ProQuest Central Student
ProQuest Biological Science Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
Coronavirus Research Database
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Natural Science Collection
Biological Science Database
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Academic UKI Edition
Natural Science Collection
ProQuest Central Korea
Biological Science Collection
ProQuest One Academic
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
Publicly Available Content Database
PubMed
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: BENPR
  name: AUTh Library subscriptions: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 2692-8205
ExternalDocumentID 36711818
Genre Preprint
Working Paper/Pre-Print
GrantInformation_xml – fundername: NIGMS NIH HHS
  grantid: R01 GM127430
– fundername: NIAID NIH HHS
  grantid: R21 AI154387
GroupedDBID 8FE
8FH
AFKRA
ALMA_UNASSIGNED_HOLDINGS
BBNVY
BENPR
BHPHI
HCIFZ
LK8
M7P
NPM
NQS
PIMPY
PROAC
RHI
ABUWG
AZQEC
CCPQU
COVID
DWQXO
GNUQQ
PQEST
PQQKQ
PQUKI
PRINS
7X8
ID FETCH-LOGICAL-c2307-4fbc214e56c1e142418aa5eebf19da5c90b203aa690e05e9812d4d4a29084ebd3
IEDL.DBID COVID
ISSN 2692-8205
IngestDate Mon Jul 01 16:30:40 EDT 2024
Thu Oct 10 16:13:03 EDT 2024
Wed Oct 23 09:43:26 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Keywords gene signatures
Tuberculosis
genet set scoring methods
reproducibility
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2307-4fbc214e56c1e142418aa5eebf19da5c90b203aa690e05e9812d4d4a29084ebd3
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Working Paper/Pre-Print-1
content type line 23
OpenAccessLink https://proxy.k.utb.cz/login?url=https://www.proquest.com/docview/2767352388?pq-origsite=%requestingapplication%
PMID 36711818
PQID 2767352388
PQPubID 2050091
ParticipantIDs proquest_miscellaneous_2771087232
proquest_journals_2767352388
pubmed_primary_36711818
PublicationCentury 2000
PublicationDate 2023-Jan-30
PublicationDateYYYYMMDD 2023-01-30
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-Jan-30
  day: 30
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: Cold Spring Harbor
PublicationTitle bioRxiv
PublicationTitleAlternate bioRxiv
PublicationYear 2023
Publisher Cold Spring Harbor Laboratory Press
Publisher_xml – name: Cold Spring Harbor Laboratory Press
References 38902649 - BMC Infect Dis. 2024 Jun 20;24(1):610. doi: 10.1186/s12879-024-09457-z
References_xml
SSID ssj0002961374
Score 1.8753843
Snippet Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression...
Rationale: Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
SubjectTerms Clinical outcomes
Correlation analysis
Gene set enrichment analysis
Mathematical models
Software
Statistical analysis
Transcriptomics
Tuberculosis
Title Comparison of gene set scoring methods for reproducible evaluation of multiple tuberculosis gene signatures
URI https://www.ncbi.nlm.nih.gov/pubmed/36711818
https://www.proquest.com/docview/2767352388
https://www.proquest.com/docview/2771087232/abstract/
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Nj9MwEB1Be9kTID62S6mMxDUlie3UPiFRWi0clgpR1Fvkj-mqapV0m-TCr8fTpK2EtBw4T2JZzth5M_P8BuCDmqTrgEN5lHiFUUDgYc-lOou4V1paKYzDI0H2Lrtdim8rueoSblVHqzydiceD2peOcuQf00k2CWCBK_Vp_xBR1yiqrnYtNJ5Cn6sQ6fSgP_3-6-uXrnwZ3I2Ce07CnIkey5Q0eR-HksdfyvwZ5KfJtEyS7bip7dj9_kun8f9n-xz6C7PHwwt4gsVL2E7PHQdZuWbBb5BVWLPKHSl4rG0lXbEAYhkpXZIQ7MbukF30wOm9EwGR1Y3Fg2t2ZbWputE2961MaPUKlvPZz-lt1HVaiBwRwSOxti5NBMrMJUh33xJljES060R7I52ObRpzY0IojbFEHVCBF16YVMdKoPX8NfSKssBrYDwWNpi8lxhMMjGaZya2TnND0vxiAMPTiuXddqnyy3IN4P3ZHBydqhemwLKhZwIYCq7F0wG8aT9bvm8VOXKeTegCrbr59-Bv4YrcgTIoPB5Crz40-C5gitqOoP95drf4Meoc6A96vtL1
link.rule.ids 315,786,790,21416,27955,27956,33777,33778,38549,43838,43928
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07b9swED609tBOSdCXGzdhga5yJT5kcsrg1nAeTTvERTaBj3NgxLBcS1r668uz5AQo0A6ZjyII6kh-vPv4HcAnPeaLiENFkgWNSUTgcc1xkyciaKOcktbjjiB7nc_m8uJW3XYBt6qjVe73xN1GHUpPMfLPfJyPI1gQWp9tfiVUNYqyq10JjefQjwcr1z3oT77_PP_SpS-ju9HlXpAwZ2ZGipMm77-h5O5ImR5AsR9MyyS5HzW1G_nff-k0Pn20h9D_YTe4PYJnuH4F95OHioOsXLDoN8gqrFnldxQ81paSrlgEsYyULkkIdulWyB71wOm7PQGR1Y3DrW9WZbWsut6Wd61MaPUa5tOvN5NZ0lVaSDwRwRO5cJ5nElXuM6S3b5m2ViG6RWaCVd6kjqfC2niVxlShiaggyCAtN6mW6IJ4A711ucZ3wEQqXTSFoDCaVGaNyG3qvBGWpPnlAIb7GSu65VIVj9M1gI8P5ujolL2waywbahPBUHQtwQfwtv1txaZV5ChEPqYHtPr9_zs_hRezm29XxdX59eUxvCTXoGiKSIfQq7cNfoj4onYnnRP9AdOu1EI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparison+of+gene+set+scoring+methods+for+reproducible+evaluation+of+multiple+tuberculosis+gene+signatures&rft.jtitle=bioRxiv&rft.au=Wang%2C+Xutao&rft.au=VanValkenberg%2C+Arthur&rft.au=Odom-Mabey%2C+Aubrey+R&rft.au=Ellner%2C+Jerrold+J&rft.date=2023-01-30&rft.issn=2692-8205&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F2023.01.19.520627&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2692-8205&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2692-8205&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2692-8205&client=summon