Deep Semi-Supervised Learning Improves Universal Peptide Identification of Shotgun Proteomics Data

Abstract Semi-supervised machine learning post-processors critically improve peptide identification of shot-gun proteomics data. Such post-processors accept the peptide-spectrum matches (PSMs) and feature vectors resulting from a database search, train a machine learning classifier, and recalibrate...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Halloran, John T, Urban, Gregor, Rocke, David, Baldi, Pierre
Format Paper
LanguageEnglish
Published Cold Spring Harbor Cold Spring Harbor Laboratory Press 09.12.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Abstract Semi-supervised machine learning post-processors critically improve peptide identification of shot-gun proteomics data. Such post-processors accept the peptide-spectrum matches (PSMs) and feature vectors resulting from a database search, train a machine learning classifier, and recalibrate PSMs using the trained parameters, often yielding significantly more identified peptides across q-value thresholds. However, current state-of-the-art post-processors rely on shallow machine learning methods, such as support vector machines. In contrast, the powerful training capabilities of deep learning models have displayed superior performance to shallow models in an ever-growing number of other fields. In this work, we show that deep models significantly improve the recalibration of PSMs compared to the most accurate and widely-used post-processors, such as Percolator and PeptideProphet. Furthermore, we show that deep learning is able to adaptively analyze complex datasets and features for more accurate universal post-processing, leading to both improved Prosit analysis and markedly better recalibration of recently developed database-search functions. Competing Interest Statement The authors have declared no competing interest. Footnotes * Further results have been included to both quantify the amount of classification information available in MS/MS datasets and study how the amount of information affects resulting post-processing performance given deep/shallow machine learning models. * http://jthalloran.ucdavis.edu/proteoTorchData.html * https://github.com/proteoTorch/proteoTorch
DOI:10.1101/2020.11.12.380881