Nearly Optimal Rates of Privacy-Preserving Sparse Generalized Eigenvalue Problem

In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on knowledge and data engineering Vol. 36; no. 8; pp. 4101 - 4115
Main Authors	Hu, Lijie, Xiang, Zihang, Liu, Jiabin, Wang, Di
Format	Journal Article
Language	English
Published	New York IEEE 01.08.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Correlation analysis Differential privacy Dimension reduction Dimensional analysis Dimensionality reduction Discriminant analysis Eigenvalues Eigenvalues and eigenfunctions Estimation error generalized eigenvalue problem Lower bounds Parameters Principal component analysis Principal components analysis Privacy sliced inverse regression Statistical analysis Statistical models Stochastic processes Upper bound
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a <inline-formula><tex-math notation="LaTeX">\rho</tex-math> <mml:math><mml:mi>ρ</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-3330775.gif"/> </inline-formula>-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has an <inline-formula><tex-math notation="LaTeX">\ell _{2}</tex-math> <mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq2-3330775.gif"/> </inline-formula>-norm estimation error of <inline-formula><tex-math notation="LaTeX">\tilde{O}(\frac{d}{n}+\frac{d}{n^{2}\rho })</tex-math> <mml:math><mml:mrow><mml:mover accent="true"><mml:mi>O</mml:mi><mml:mo>˜</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>ρ</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq3-3330775.gif"/> </inline-formula> (under some mild assumptions), where <inline-formula><tex-math notation="LaTeX">d</tex-math> <mml:math><mml:mi>d</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq4-3330775.gif"/> </inline-formula> is the dimension and <inline-formula><tex-math notation="LaTeX">n</tex-math> <mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq5-3330775.gif"/> </inline-formula> is the sample size. Next, we discuss how to find such an initial parameter privately. In the high dimensional sparse case where <inline-formula><tex-math notation="LaTeX">d\gg n</tex-math> <mml:math><mml:mrow><mml:mi>d</mml:mi><mml:mo>≫</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq6-3330775.gif"/> </inline-formula>, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of <inline-formula><tex-math notation="LaTeX">\tilde{O}(\frac{s\log d}{n}+\frac{s\log d}{n^{2}\rho })</tex-math> <mml:math><mml:mrow><mml:mover accent="true"><mml:mi>O</mml:mi><mml:mo>˜</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>ρ</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq7-3330775.gif"/> </inline-formula> for various statistical models, where <inline-formula><tex-math notation="LaTeX">s</tex-math> <mml:math><mml:mi>s</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq8-3330775.gif"/> </inline-formula> is the sparsity of the underlying parameter. Moreover, we show that these errors in the stochastic setting are optimal up to a factor of <inline-formula><tex-math notation="LaTeX">\text{Poly}(\log n)</tex-math> <mml:math><mml:mrow><mml:mtext>Poly</mml:mtext><mml:mo>(</mml:mo><mml:mo form="prefix">log</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq9-3330775.gif"/> </inline-formula> by providing the lower bounds of PCA and SIR under the statistical setting and in the CDP model. Finally, to give a separation between <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq10-3330775.gif"/> </inline-formula>-DP and <inline-formula><tex-math notation="LaTeX">\rho</tex-math> <mml:math><mml:mi>ρ</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq11-3330775.gif"/> </inline-formula>-CDP for GEP, we also provide the lower bound <inline-formula><tex-math notation="LaTeX">\Omega (\frac{d}{n}+\frac{d^{2}}{n^{2}\epsilon ^{2}})</tex-math> <mml:math><mml:mrow><mml:mi>Ω</mml:mi><mml:mo>(</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq12-3330775.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">\Omega (\frac{s\log d}{n}+\frac{s^{2}\log ^{2}d}{n^{2}\epsilon ^{2}})</tex-math> <mml:math><mml:mrow><mml:mi>Ω</mml:mi><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq13-3330775.gif"/> </inline-formula> of private minimax risk for PCA, under the statistical setting and <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq14-3330775.gif"/> </inline-formula>-DP model, in low and high dimensional sparse case respectively. Finally, extensive experiments on both synthetic and real-world data support our previous theoretical analysis.
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2023.3330775