Identifying Relevant Covariates in RNA-seq Analysis by Pseudo-Variable Augmentation

RNA-sequencing (RNA-seq) technology allows for the identification of differentially expressed genes, which are genes whose mean transcript abundance levels vary across conditions. In practice, RNA-seq datasets often include covariates that are of primary interest in addition to a set of covariates t...

Full description

Saved in:
Bibliographic Details
Published inJournal of agricultural, biological, and environmental statistics
Main Authors Nguyen, Yet, Nettleton, Dan
Format Journal Article
LanguageEnglish
Published 02.11.2024
Online AccessGet full text

Cover

Loading…
More Information
Summary:RNA-sequencing (RNA-seq) technology allows for the identification of differentially expressed genes, which are genes whose mean transcript abundance levels vary across conditions. In practice, RNA-seq datasets often include covariates that are of primary interest in addition to a set of covariates that are subject to selection. Some of these covariates may be relevant to gene expression levels, while others may be irrelevant. Ignoring relevant covariates or attempting to adjust for the effect of irrelevant covariates can compromise the identification of differentially expressed genes. To address this issue, we propose a variable selection method that uses pseudo-variables to control the expected proportion of selected covariates that are irrelevant. Our method accurately selects relevant covariates while keeping the false selection rate below a specified level. We demonstrate that our method outperforms existing methods for detecting differentially expressed genes when working with available covariates. Our method is implemented in function of the R package , which is available at www.github.com/ntyet/csrnaseq . The analysis and simulation are available at www.github.com/ntyet/csrnaseq/tree/main/analysis .
ISSN:1085-7117
1537-2693
DOI:10.1007/s13253-024-00665-3