Nearly Optimal Rates of Privacy-Preserving Sparse Generalized Eigenvalue Problem

In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on knowledge and data engineering Vol. 36; no. 8; pp. 4101 - 4115
Main Authors Hu, Lijie, Xiang, Zihang, Liu, Jiabin, Wang, Di
Format Journal Article
LanguageEnglish
Published New York IEEE 01.08.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a <inline-formula><tex-math notation="LaTeX">\rho</tex-math> <mml:math><mml:mi>ρ</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-3330775.gif"/> </inline-formula>-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has an <inline-formula><tex-math notation="LaTeX">\ell _{2}</tex-math> <mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq2-3330775.gif"/> </inline-formula>-norm estimation error of <inline-formula><tex-math notation="LaTeX">\tilde{O}(\frac{d}{n}+\frac{d}{n^{2}\rho })</tex-math> <mml:math><mml:mrow><mml:mover accent="true"><mml:mi>O</mml:mi><mml:mo>˜</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>ρ</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq3-3330775.gif"/> </inline-formula> (under some mild assumptions), where <inline-formula><tex-math notation="LaTeX">d</tex-math> <mml:math><mml:mi>d</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq4-3330775.gif"/> </inline-formula> is the dimension and <inline-formula><tex-math notation="LaTeX">n</tex-math> <mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq5-3330775.gif"/> </inline-formula> is the sample size. Next, we discuss how to find such an initial parameter privately. In the high dimensional sparse case where <inline-formula><tex-math notation="LaTeX">d\gg n</tex-math> <mml:math><mml:mrow><mml:mi>d</mml:mi><mml:mo>≫</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq6-3330775.gif"/> </inline-formula>, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of <inline-formula><tex-math notation="LaTeX">\tilde{O}(\frac{s\log d}{n}+\frac{s\log d}{n^{2}\rho })</tex-math> <mml:math><mml:mrow><mml:mover accent="true"><mml:mi>O</mml:mi><mml:mo>˜</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>ρ</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq7-3330775.gif"/> </inline-formula> for various statistical models, where <inline-formula><tex-math notation="LaTeX">s</tex-math> <mml:math><mml:mi>s</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq8-3330775.gif"/> </inline-formula> is the sparsity of the underlying parameter. Moreover, we show that these errors in the stochastic setting are optimal up to a factor of <inline-formula><tex-math notation="LaTeX">\text{Poly}(\log n)</tex-math> <mml:math><mml:mrow><mml:mtext>Poly</mml:mtext><mml:mo>(</mml:mo><mml:mo form="prefix">log</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq9-3330775.gif"/> </inline-formula> by providing the lower bounds of PCA and SIR under the statistical setting and in the CDP model. Finally, to give a separation between <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq10-3330775.gif"/> </inline-formula>-DP and <inline-formula><tex-math notation="LaTeX">\rho</tex-math> <mml:math><mml:mi>ρ</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq11-3330775.gif"/> </inline-formula>-CDP for GEP, we also provide the lower bound <inline-formula><tex-math notation="LaTeX">\Omega (\frac{d}{n}+\frac{d^{2}}{n^{2}\epsilon ^{2}})</tex-math> <mml:math><mml:mrow><mml:mi>Ω</mml:mi><mml:mo>(</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq12-3330775.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">\Omega (\frac{s\log d}{n}+\frac{s^{2}\log ^{2}d}{n^{2}\epsilon ^{2}})</tex-math> <mml:math><mml:mrow><mml:mi>Ω</mml:mi><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq13-3330775.gif"/> </inline-formula> of private minimax risk for PCA, under the statistical setting and <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq14-3330775.gif"/> </inline-formula>-DP model, in low and high dimensional sparse case respectively. Finally, extensive experiments on both synthetic and real-world data support our previous theoretical analysis.
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2023.3330775