Supervised Application of Internal Validation Measures to Benchmark Dimensionality Reduction Methods in scRNA-seq Data

Abstract A typical single-cell RNA sequencing (scRNA-seq) experiment will measure on the order of 20,000 transcripts and thousands, if not millions, of cells. The high dimensionality of such data presents serious complications for traditional data analysis methods and, as such, methods to reduce dim...

Full description

Saved in:

Bibliographic Details
Published in	bioRxiv
Main Authors	Koch, rest C, Sutton, Gavin J, Voineagu, Irina, Vafaee, Fatemeh
Format	Paper
Language	English
Published	Cold Spring Harbor Cold Spring Harbor Laboratory Press 01.11.2020
Subjects	Algorithms Clustering Computer applications Embedding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract A typical single-cell RNA sequencing (scRNA-seq) experiment will measure on the order of 20,000 transcripts and thousands, if not millions, of cells. The high dimensionality of such data presents serious complications for traditional data analysis methods and, as such, methods to reduce dimensionality play an integral role in many analysis pipelines. However, few studies benchmark the performance of these methods on scRNA-seq data, with existing comparisons assessing performance via downstream analysis accuracy measures which may confound the interpretation of their results. Here, we present the most comprehensive benchmark of dimensionality reduction methods in scRNA-seq data to date, utilizing over 300,000 compute hours to assess the performance of over 25,000 low dimension embeddings across 33 dimensionality reduction methods and 55 scRNA-seq datasets (ranging from 66-27,500 cells). We employ a simple-yet-novel approach which does not rely on the results of downstream analyses. Internal validation measures (IVMs), traditionally used as an unsupervised method to assess clustering performance, are repurposed to measure how well-formed biological clusters are after dimensionality reduction. Performance was further evaluated using nearly 200,000,000 iterations of DBSCAN, a density-based clustering algorithm, showing that hyperparameter optimization using IVMs as the objective function leads to near-optimal clustering. Methods were also assessed on the extent to which they preserve the global structure of the data, and on their computational memory and time requirements across a large range of sample sizes. Our comprehensive benchmarking analysis provides a valuable resource for researchers and aims to guide best practice for dimensionality reduction in scRNA-seq analyses, and we highlight LDA (Latent Dirichlet Allocation) and PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) as high-performing algorithms. Competing Interest Statement The authors have declared no competing interest. Footnotes * - Author names and institutions were revised - License changed to CC-NC-ND * https://github.com/ForrestCKoch/scRNA-Dimensionality-Reduction
DOI:	10.1101/2020.10.29.361451