ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences an...
Saved in:
Published in | ACS synthetic biology Vol. 12; no. 12; pp. 3544 - 3561 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
United States
American Chemical Society
15.12.2023
|
Subjects | |
Online Access | Get full text |
ISSN | 2161-5063 2161-5063 |
DOI | 10.1021/acssynbio.3c00261 |
Cover
Loading…
Summary: | Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model’s ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker’s yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 N.P., X.L., R.R., and A.L.F. designed the research. R.R. and A.L.F. supervised the research. N.P. conceptualized, developed, and deployed the machine learning models. N.P. and X.L. conducted the experimental studies. N.P., X.L., and A.L.F. analyzed the data, wrote the paper, and revised the paper. Author Contributions |
ISSN: | 2161-5063 2161-5063 |
DOI: | 10.1021/acssynbio.3c00261 |