Sparse Canonical Correlation Analysis with Application to Genomic Data Integration

Abstract Large scale genomic studies with multiple phenotypic or genotypic measures may require the identification of complex multivariate relationships. In multivariate analysis a common way to inspect the relationship between two sets of variables based on their correlation is canonical correlatio...

Full description

Saved in:

Bibliographic Details
Published in	Statistical Applications in Genetics and Molecular Biology Vol. 8; no. 1; pp. 1 - 34
Main Authors	Parkhomenko, Elena, Tritchler, David, Beyene, Joseph
Format	Journal Article
Language	English
Published	Germany bepress 01.01.2009 De Gruyter
Subjects	Algorithms canonical correlation data integration Genomics - statistics & numerical data Humans Models, Statistical Sample Size sparseness
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract Large scale genomic studies with multiple phenotypic or genotypic measures may require the identification of complex multivariate relationships. In multivariate analysis a common way to inspect the relationship between two sets of variables based on their correlation is canonical correlation analysis, which determines linear combinations of all variables of each type with maximal correlation between the two linear combinations. However, in high dimensional data analysis, when the number of variables under consideration exceeds tens of thousands, linear combinations of the entire sets of features may lack biological plausibility and interpretability. In addition, insufficient sample size may lead to computational problems, inaccurate estimates of parameters and non-generalizable results. These problems may be solved by selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combinations of variables of each type. In this paper we present Sparse Canonical Correlation Analysis (SCCA) which examines the relationships between two types of variables and provides sparse solutions that include only small subsets of variables of each type by maximizing the correlation between the subsets of variables of different types while performing variable selection. We also present an extension of SCCA - adaptive SCCA. We evaluate their properties using simulated data and illustrate practical use by applying both methods to the study of natural variation in human gene expression. Submitted: July 31, 2008 · Accepted: November 29, 2008 · Published: January 6, 2009 Recommended Citation Parkhomenko, Elena; Tritchler, David; and Beyene, Joseph (2009) "Sparse Canonical Correlation Analysis with Application to Genomic Data Integration," Statistical Applications in Genetics and Molecular Biology: Vol. 8 : Iss. 1, Article 1. DOI: 10.2202/1544-6115.1406 Available at: http://www.bepress.com/sagmb/vol8/iss1/art1
Bibliography:	istex:D09275471E6DB592540A26646D61065571E013D7 ArticleID:1544-6115.1406 sagmb.2009.8.1.1406.pdf ark:/67375/QT4-MFBTMSB1-P ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1544-6115 1544-6115
DOI:	10.2202/1544-6115.1406