TCRCluster: a novel approach to T-cell receptor latent featurization and clustering using contrastive learning-guided two-stage variational autoencoders

T cells play a vital role in adaptive immunity by targeting pathogen-infected or cancerous cells, but predicting their specificity remains challenging. Encoding T-cell receptor (TCR) sequences into informative feature spaces is therefore crucial for advancing specificity prediction and downstream ap...

Full description

Saved in:

Bibliographic Details
Published in	NAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf065
Main Authors	Wan, Yat-Tsai Richie, Nielsen, Morten
Format	Journal Article
Language	English
Published	England Oxford University Press 01.06.2025
Subjects	Algorithms Autoencoder Cluster Analysis Complementarity Determining Regions - genetics Humans Receptors, Antigen, T-Cell - genetics Receptors, Antigen, T-Cell, alpha-beta - genetics
Online Access	Get full text
ISSN	2631-9268 2631-9268
DOI	10.1093/nargab/lqaf065

Cover

Loading…

More Information
Summary:	T cells play a vital role in adaptive immunity by targeting pathogen-infected or cancerous cells, but predicting their specificity remains challenging. Encoding T-cell receptor (TCR) sequences into informative feature spaces is therefore crucial for advancing specificity prediction and downstream applications. For this, we developed a variational autoencoder (VAE)-based model trained on paired TCR α–β chain data, incorporating all six complementarity-determining regions. A semi-supervised ‘two-stage VAE’ framework, integrating cosine triplet loss and a classifier, was found to further refine peptide-specific latent representations, outperforming sequence-based methods in specificity prediction. Clustering analyses leveraging our VAE latent space were evaluated using K-means, agglomerative clustering, and a novel graph-based method. Agglomerative clustering achieved the most biologically relevant results, balancing cluster purity and retention despite noise in TCR specificity annotations. We extended these insights to evaluate TCR repertoire data. Across datasets, VAE-based models outperformed sequence-based methods, particularly in retention metrics, with notable improvements in the SARS-CoV-2 repertoire dataset. Moreover, the cancer repertoire analysis highlighted the generalizability of our approach, where the model displayed high performance despite minimal similarity between the training and test data. Collectively, these results demonstrate the potential of VAE-based latent representations to offer a robust framework for prediction, clustering, and repertoire analysis.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2631-9268 2631-9268
DOI:	10.1093/nargab/lqaf065