Comparison of gene set scoring methods for reproducible evaluation of multiple tuberculosis gene signatures
Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of...
Saved in:
Published in | bioRxiv |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article Paper |
Language | English |
Published |
United States
Cold Spring Harbor Laboratory Press
30.01.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed.
We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use.
We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models.
For many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies.
Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement. |
---|---|
AbstractList | Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed.RationaleMany blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed.We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use.ObjectivesWe compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use.We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models.MethodsWe considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models.For many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies.Measurement and Main ResultsFor many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies.Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.ConclusionGene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement. Rationale: Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures′ replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed. Objectives: We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use. Methods: We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature′s performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models. Measurement and Main Results: For many signatures, the gene set scoring method predictions were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies. Conclusion: Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement.Competing Interest StatementThe authors have declared no competing interest.Footnotes* author updated Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression from infection to disease, and monitor TB treatment outcomes. However, an unresolved issue is whether gene set enrichment analysis (GSEA) of the signature transcripts alone is sufficient for prediction and differentiation, or whether it is necessary to use the original statistical model created when the signature was derived. Intra-method comparison is complicated by the unavailability of original training data, missing details about the original trained model, and inadequate publicly-available software tools or source code implementing models. To facilitate these signatures' replicability and appropriate utilization in TB research, comprehensive comparisons between gene set scoring methods with cross-data validation of original model implementations are needed. We compared the performance of 19 TB gene signatures across 24 transcriptomic datasets using both re-rebuilt original models and gene set scoring methods to evaluate whether gene set scoring is a reasonable proxy to the performance of the original trained model. We have provided an open-access software implementation of the original models for all 19 signatures for future use. We considered existing gene set scoring and machine learning methods, including ssGSEA, GSVA, PLAGE, Singscore, and Zscore, as alternative approaches to profile gene signature performance. The sample-size-weighted mean area under the curve (AUC) value was computed to measure each signature's performance across datasets. Correlation analysis and Wilcoxon paired tests were used to analyze the performance of enrichment methods with the original models. For many signatures, the predictions from gene set scoring methods were highly correlated and statistically equivalent to the results given by the original diagnostic models. PLAGE outperformed all other gene scoring methods. In some cases, PLAGE outperformed the original models when considering signatures' weighted mean AUC values and the AUC results within individual studies. Gene set enrichment scoring of existing blood-based biomarker gene sets can distinguish patients with active TB disease from latent TB infection and other clinical conditions with equivalent or improved accuracy compared to the original methods and models. These data justify using gene set scoring methods of published TB gene signatures for predicting TB risk and treatment outcomes, especially when original models are difficult to apply or implement. |
Author | Salgame, Padmini VanValkenberg, Arthur Patil, Prasad Johnson, W Evan Wang, Xutao Ellner, Jerrold J Odom-Mabey, Aubrey R Hochberg, Natasha S |
Author_xml | – sequence: 1 givenname: Xutao surname: Wang fullname: Wang, Xutao organization: Division of Computational Biomedicine and Bioinformatics Program, Boston University, Boston, MA, USA – sequence: 2 givenname: Arthur surname: VanValkenberg fullname: VanValkenberg, Arthur organization: Division of Computational Biomedicine and Bioinformatics Program, Boston University, Boston, MA, USA – sequence: 3 givenname: Aubrey R surname: Odom-Mabey fullname: Odom-Mabey, Aubrey R organization: Division of Computational Biomedicine and Bioinformatics Program, Boston University, Boston, MA, USA – sequence: 4 givenname: Jerrold J surname: Ellner fullname: Ellner, Jerrold J organization: Department of Medicine, Center for Emerging Pathogens, Rutgers New Jersey Medical School, Newark, NJ, USA – sequence: 5 givenname: Natasha S surname: Hochberg fullname: Hochberg, Natasha S organization: Section of Infectious Diseases, Boston University School of Medicine, Boston, MA, USA – sequence: 6 givenname: Padmini surname: Salgame fullname: Salgame, Padmini organization: Department of Medicine, Center for Emerging Pathogens, Rutgers New Jersey Medical School, Newark, NJ, USA – sequence: 7 givenname: Prasad surname: Patil fullname: Patil, Prasad organization: Department of Biostatistics, Boston University, Boston, MA, USA – sequence: 8 givenname: W Evan surname: Johnson fullname: Johnson, W Evan organization: Division of Infectious Disease, Center for Data Science, Rutgers New Jersey Medical School, Newark, NJ, USA |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/36711818$$D View this record in MEDLINE/PubMed |
BookMark | eNpdkE9Lw0AQxRep2Fr7AbxIwIuXxp3ZTbI5SvEfFLzoOWySSd2aZONuVvDbG2gF8TSP4fcej3fOZr3tibFL4DEAh1vkKGIOMeRxgjzF7IQtMM1xrZAnsz96zlbe7znnmKcgMnnG5iLNABSoBfvY2G7QznjbR7aJdtRT5GmMfGWd6XdRR-O7rX3UWBc5GpytQ2XKliL60m3Qozn4utCOZpjeYyjJVaG13vhjmtn1egyO_AU7bXTraXW8S_b2cP-6eVpvXx6fN3fbdYWCZ2vZlBWCpCStgECiBKV1QlQ2kNc6qXJeIhdapzknnlCuAGtZS405V5LKWizZzSF3qvsZyI9FZ3xFbat7ssEXmGXAVYYCJ_T6H7q3wfVTu4lKM5GgUGqiro5UKDuqi8GZTrvv4ndG8QOdNXgR |
CitedBy_id | crossref_primary_10_3389_fimmu_2023_1210372 |
ContentType | Journal Article Paper |
Copyright | 2023. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2023. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | NPM 8FE 8FH AAFGM AAMXL ABOIG ABUWG ADZZV AFKRA AFLLJ AFOLM AGAJT AQTIP AZQEC BBNVY BENPR BHPHI CCPQU COVID DWQXO GNUQQ HCIFZ LK8 M7P PIMPY PQCXX PQEST PQQKQ PQUKI PRINS 7X8 |
DOI | 10.1101/2023.01.19.520627 |
DatabaseName | PubMed ProQuest SciTech Collection ProQuest Natural Science Collection ProQuest Central Korea - hybrid linking Natural Science Collection - hybrid linking Biological Science Collection - hybrid linking ProQuest Central (Alumni) ProQuest Central (Alumni) - hybrid linking ProQuest Central SciTech Premium Collection - hybrid linking ProQuest Central Student - hybrid linking ProQuest Central Essentials - hybrid linking ProQuest Women's & Gender Studies - hybrid linking ProQuest Central Essentials Biological Science Collection AUTh Library subscriptions: ProQuest Central ProQuest Natural Science Collection ProQuest One Community College Coronavirus Research Database ProQuest Central ProQuest Central Student SciTech Premium Collection (Proquest) (PQ_SDU_P3) Biological Sciences Biological Science Database Access via ProQuest (Open Access) ProQuest Central - hybrid linking ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China MEDLINE - Academic |
DatabaseTitle | PubMed Publicly Available Content Database ProQuest Central Student ProQuest Biological Science Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition Coronavirus Research Database ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Natural Science Collection Biological Science Database ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Academic UKI Edition Natural Science Collection ProQuest Central Korea Biological Science Collection ProQuest One Academic MEDLINE - Academic |
DatabaseTitleList | MEDLINE - Academic Publicly Available Content Database PubMed |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: BENPR name: AUTh Library subscriptions: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Biology |
EISSN | 2692-8205 |
ExternalDocumentID | 36711818 |
Genre | Preprint Working Paper/Pre-Print |
GrantInformation_xml | – fundername: NIGMS NIH HHS grantid: R01 GM127430 – fundername: NIAID NIH HHS grantid: R21 AI154387 |
GroupedDBID | 8FE 8FH AFKRA ALMA_UNASSIGNED_HOLDINGS BBNVY BENPR BHPHI HCIFZ LK8 M7P NPM NQS PIMPY PROAC RHI ABUWG AZQEC CCPQU COVID DWQXO GNUQQ PQEST PQQKQ PQUKI PRINS 7X8 |
ID | FETCH-LOGICAL-c2307-4fbc214e56c1e142418aa5eebf19da5c90b203aa690e05e9812d4d4a29084ebd3 |
IEDL.DBID | COVID |
ISSN | 2692-8205 |
IngestDate | Mon Jul 01 16:30:40 EDT 2024 Thu Oct 10 16:13:03 EDT 2024 Wed Oct 23 09:43:26 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Keywords | gene signatures Tuberculosis genet set scoring methods reproducibility |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c2307-4fbc214e56c1e142418aa5eebf19da5c90b203aa690e05e9812d4d4a29084ebd3 |
Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Working Paper/Pre-Print-1 content type line 23 |
OpenAccessLink | https://proxy.k.utb.cz/login?url=https://www.proquest.com/docview/2767352388?pq-origsite=%requestingapplication% |
PMID | 36711818 |
PQID | 2767352388 |
PQPubID | 2050091 |
ParticipantIDs | proquest_miscellaneous_2771087232 proquest_journals_2767352388 pubmed_primary_36711818 |
PublicationCentury | 2000 |
PublicationDate | 2023-Jan-30 |
PublicationDateYYYYMMDD | 2023-01-30 |
PublicationDate_xml | – month: 01 year: 2023 text: 2023-Jan-30 day: 30 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States – name: Cold Spring Harbor |
PublicationTitle | bioRxiv |
PublicationTitleAlternate | bioRxiv |
PublicationYear | 2023 |
Publisher | Cold Spring Harbor Laboratory Press |
Publisher_xml | – name: Cold Spring Harbor Laboratory Press |
References | 38902649 - BMC Infect Dis. 2024 Jun 20;24(1):610. doi: 10.1186/s12879-024-09457-z |
References_xml | |
SSID | ssj0002961374 |
Score | 1.8753843 |
Snippet | Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of progression... Rationale: Many blood-based transcriptional gene signatures for tuberculosis (TB) have been developed with potential use to diagnose disease, predict risk of... |
SourceID | proquest pubmed |
SourceType | Aggregation Database Index Database |
SubjectTerms | Clinical outcomes Correlation analysis Gene set enrichment analysis Mathematical models Software Statistical analysis Transcriptomics Tuberculosis |
Title | Comparison of gene set scoring methods for reproducible evaluation of multiple tuberculosis gene signatures |
URI | https://www.ncbi.nlm.nih.gov/pubmed/36711818 https://www.proquest.com/docview/2767352388 https://www.proquest.com/docview/2771087232/abstract/ |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Nj9MwEB1Be9kTID62S6mMxDUlie3UPiFRWi0clgpR1Fvkj-mqapV0m-TCr8fTpK2EtBw4T2JZzth5M_P8BuCDmqTrgEN5lHiFUUDgYc-lOou4V1paKYzDI0H2Lrtdim8rueoSblVHqzydiceD2peOcuQf00k2CWCBK_Vp_xBR1yiqrnYtNJ5Cn6sQ6fSgP_3-6-uXrnwZ3I2Ce07CnIkey5Q0eR-HksdfyvwZ5KfJtEyS7bip7dj9_kun8f9n-xz6C7PHwwt4gsVL2E7PHQdZuWbBb5BVWLPKHSl4rG0lXbEAYhkpXZIQ7MbukF30wOm9EwGR1Y3Fg2t2ZbWputE2961MaPUKlvPZz-lt1HVaiBwRwSOxti5NBMrMJUh33xJljES060R7I52ObRpzY0IojbFEHVCBF16YVMdKoPX8NfSKssBrYDwWNpi8lxhMMjGaZya2TnND0vxiAMPTiuXddqnyy3IN4P3ZHBydqhemwLKhZwIYCq7F0wG8aT9bvm8VOXKeTegCrbr59-Bv4YrcgTIoPB5Crz40-C5gitqOoP95drf4Meoc6A96vtL1 |
link.rule.ids | 315,786,790,21416,27955,27956,33777,33778,38549,43838,43928 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07b9swED609tBOSdCXGzdhga5yJT5kcsrg1nAeTTvERTaBj3NgxLBcS1r668uz5AQo0A6ZjyII6kh-vPv4HcAnPeaLiENFkgWNSUTgcc1xkyciaKOcktbjjiB7nc_m8uJW3XYBt6qjVe73xN1GHUpPMfLPfJyPI1gQWp9tfiVUNYqyq10JjefQjwcr1z3oT77_PP_SpS-ju9HlXpAwZ2ZGipMm77-h5O5ImR5AsR9MyyS5HzW1G_nff-k0Pn20h9D_YTe4PYJnuH4F95OHioOsXLDoN8gqrFnldxQ81paSrlgEsYyULkkIdulWyB71wOm7PQGR1Y3DrW9WZbWsut6Wd61MaPUa5tOvN5NZ0lVaSDwRwRO5cJ5nElXuM6S3b5m2ViG6RWaCVd6kjqfC2niVxlShiaggyCAtN6mW6IJ4A711ucZ3wEQqXTSFoDCaVGaNyG3qvBGWpPnlAIb7GSu65VIVj9M1gI8P5ujolL2waywbahPBUHQtwQfwtv1txaZV5ChEPqYHtPr9_zs_hRezm29XxdX59eUxvCTXoGiKSIfQq7cNfoj4onYnnRP9AdOu1EI |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparison+of+gene+set+scoring+methods+for+reproducible+evaluation+of+multiple+tuberculosis+gene+signatures&rft.jtitle=bioRxiv&rft.au=Wang%2C+Xutao&rft.au=VanValkenberg%2C+Arthur&rft.au=Odom-Mabey%2C+Aubrey+R&rft.au=Ellner%2C+Jerrold+J&rft.date=2023-01-30&rft.issn=2692-8205&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F2023.01.19.520627&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2692-8205&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2692-8205&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2692-8205&client=summon |