Heavy-Tailed Kernels Reveal a Finer Cluster Structure in t-SNE Visualisations

T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the ‘crowding problem’ of SNE. Here, we develo...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Knowledge Discovery in Databases Vol. 11906; pp. 124 - 139
Main Authors	Kobak, Dmitry, Linderman, George, Steinerberger, Stefan, Kluger, Yuval, Berens, Philipp
Format	Book Chapter Journal Article
Language	English
Published	Cham Springer International Publishing 2020
Series	Lecture Notes in Computer Science
Subjects	Data visualisation Dimensionality reduction t-SNE t-SNE dimensionality reduction data visualisation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the ‘crowding problem’ of SNE. Here, we develop an efficient implementation of t-SNE for a t-distribution kernel with an arbitrary degree of freedom $$\nu $$ , with $$\nu \rightarrow \infty $$ corresponding to SNE and $$\nu =1$$ corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that $$\nu <1$$ can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data.
Bibliography:	Electronic supplementary materialThe online version of this chapter (10.1007/978-3-030-46150-8_8) contains supplementary material, which is available to authorized users. Original Abstract: T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the ‘crowding problem’ of SNE. Here, we develop an efficient implementation of t-SNE for a t-distribution kernel with an arbitrary degree of freedom \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document}, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu \rightarrow \infty $$\end{document} corresponding to SNE and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu =1$$\end{document} corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu <1$$\end{document} can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data. The original version of this chapter was revised: The supplementary file and its link has been added. The correction to this chapter is available at 10.1007/978-3-030-46150-8_44
ISBN:	9783030461492 3030461491
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-46150-8_8