A descriptor-free machine learning framework to improve antigen discovery for bacterial pathogens

Identifying protective antigens (PAs), i.e., targets for bacterial vaccines, is challenging as conducting in-vivo tests at the proteome scale is impractical. Reverse Vaccinology (RV) aids in narrowing down the pool of candidates through computational screening of proteomes. Within RV, one prominent...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 20; no. 6; p. e0323895
Main Authors	Podda, Marco, Savojardo, Castrense, Luigi Martelli, Pier, Casadio, Rita, Sîrbu, Alina, Priami, Corrado, Brozzi, Alessandro
Format	Journal Article
Language	English
Published	United States Public Library of Science 05.06.2025 Public Library of Science (PLoS)
Subjects	Amino acid sequence Amino acids Analysis Antigens Antigens - immunology Artificial neural networks Bacterial Infections - immunology Bacterial Infections - prevention & control Bacterial Infections - therapy Bacterial Vaccines - immunology Bioinformatics Biology and Life Sciences Clinical trials Computational linguistics Computer and Information Sciences Computer applications Computer Simulation Dosage and administration Drug Evaluation, Preclinical - methods Engineering and Technology Ethics Experiments Humans Immune response In vivo methods and tests Laboratories Language processing Learning algorithms Machine Learning Medical research Medicine and health sciences Methods Natural language interfaces Neural networks Pathogens Protective Agents - therapeutic use Proteins Proteome - metabolism Proteomes Representations Research and Analysis Methods Subject Headings Testing Vaccine Development - methods Vaccines Vaccines, Subunit - therapeutic use Vaccinology Italy
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Identifying protective antigens (PAs), i.e., targets for bacterial vaccines, is challenging as conducting in-vivo tests at the proteome scale is impractical. Reverse Vaccinology (RV) aids in narrowing down the pool of candidates through computational screening of proteomes. Within RV, one prominent approach is to train Machine Learning (ML) models to classify PAs. These models can be used to predict unseen protein sequences and assist researchers in selecting promising candidates. Traditionally, proteins are fed into these models as vectors of biological and physico-chemical descriptors derived from their residue sequences. However, this method relies on multiple third-party software packages, which may be unreliable, difficult to use, or no longer maintained. Furthermore, selecting descriptors is susceptible to biases. Hence, Protein Sequence Embeddings (PSEs)—high-dimensional vectorial representations of protein sequences obtained from pretrained deep neural networks—have emerged as an alternative to descriptors, offering data-driven feature extraction and a streamlined computational pipeline. We introduce PSEs as a descriptor-free representation of protein sequences for ML in RV. We conducted a thorough comparison of PSE-based and descriptor-based pipelines for PA classification across 10 bacterial species evaluated independently. Our results show that the PSE-based pipeline, which leverages the FAIR ESM-2 protein language model, outperformed the descriptor-based pipeline in 9 out of 10 species, with a mean Area Under the Receiver Operating Characteristics curve (AUROC) of 0.875 versus 0.855. Additionally, it achieved superior performance on the iBPA benchmark (0.86 AUROC vs. 0.82) compared to other methods in the literature. Lastly, we applied the pipeline to rank unseen proteomes based on protective potential to guide candidate selection for pre-clinical testing. Compared to the standard RV practice of ranking candidates according to their biological descriptors, our approach reduces the number of pre-clinical tests needed to identify PAs by up to 83% on average.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 These authors also contributed equally to this work. Competing Interests: This research was commissioned by GSK. AB is employed by GSK. This does not alter our adherence to PLOS ONE policies on sharing data and materials. The authors declare no other financial and non-financial relationships and activities and no other conflicts of interest.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0323895