Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides

Abstract Summary We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics (Oxford, England) Vol. 38; no. 5; pp. 1470 - 1472
Main Authors	Umer, Husen M, Audain, Enrique, Zhu, Yafeng, Pfeuffer, Julianus, Sachsenberg, Timo, Lehtiö, Janne, Branca, Rui M, Perez-Riverol, Yasset
Format	Journal Article
Language	English
Published	England Oxford University Press 07.02.2022
Subjects	Algorithms Applications Notes Humans Medicin och hälsovetenskap Peptides - genetics Proteins Proteogenomics Software
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract Summary We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. Availability and implementation The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. Supplementary information Supplementary data are available at Bioinformatics online.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1367-4803 1367-4811 1367-4811
DOI:	10.1093/bioinformatics/btab838