Nearly Optimal Rates of Privacy-Preserving Sparse Generalized Eigenvalue Problem
In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR...
Saved in:
Published in | IEEE transactions on knowledge and data engineering Vol. 36; no. 8; pp. 4101 - 4115 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.08.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a <inline-formula><tex-math notation="LaTeX">\rho</tex-math> <mml:math><mml:mi>ρ</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-3330775.gif"/> </inline-formula>-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has an <inline-formula><tex-math notation="LaTeX">\ell _{2}</tex-math> <mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq2-3330775.gif"/> </inline-formula>-norm estimation error of <inline-formula><tex-math notation="LaTeX">\tilde{O}(\frac{d}{n}+\frac{d}{n^{2}\rho })</tex-math> <mml:math><mml:mrow><mml:mover accent="true"><mml:mi>O</mml:mi><mml:mo>˜</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>ρ</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq3-3330775.gif"/> </inline-formula> (under some mild assumptions), where <inline-formula><tex-math notation="LaTeX">d</tex-math> <mml:math><mml:mi>d</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq4-3330775.gif"/> </inline-formula> is the dimension and <inline-formula><tex-math notation="LaTeX">n</tex-math> <mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq5-3330775.gif"/> </inline-formula> is the sample size. Next, we discuss how to find such an initial parameter privately. In the high dimensional sparse case where <inline-formula><tex-math notation="LaTeX">d\gg n</tex-math> <mml:math><mml:mrow><mml:mi>d</mml:mi><mml:mo>≫</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq6-3330775.gif"/> </inline-formula>, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of <inline-formula><tex-math notation="LaTeX">\tilde{O}(\frac{s\log d}{n}+\frac{s\log d}{n^{2}\rho })</tex-math> <mml:math><mml:mrow><mml:mover accent="true"><mml:mi>O</mml:mi><mml:mo>˜</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>ρ</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq7-3330775.gif"/> </inline-formula> for various statistical models, where <inline-formula><tex-math notation="LaTeX">s</tex-math> <mml:math><mml:mi>s</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq8-3330775.gif"/> </inline-formula> is the sparsity of the underlying parameter. Moreover, we show that these errors in the stochastic setting are optimal up to a factor of <inline-formula><tex-math notation="LaTeX">\text{Poly}(\log n)</tex-math> <mml:math><mml:mrow><mml:mtext>Poly</mml:mtext><mml:mo>(</mml:mo><mml:mo form="prefix">log</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq9-3330775.gif"/> </inline-formula> by providing the lower bounds of PCA and SIR under the statistical setting and in the CDP model. Finally, to give a separation between <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq10-3330775.gif"/> </inline-formula>-DP and <inline-formula><tex-math notation="LaTeX">\rho</tex-math> <mml:math><mml:mi>ρ</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq11-3330775.gif"/> </inline-formula>-CDP for GEP, we also provide the lower bound <inline-formula><tex-math notation="LaTeX">\Omega (\frac{d}{n}+\frac{d^{2}}{n^{2}\epsilon ^{2}})</tex-math> <mml:math><mml:mrow><mml:mi>Ω</mml:mi><mml:mo>(</mml:mo><mml:mfrac><mml:mi>d</mml:mi><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq12-3330775.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">\Omega (\frac{s\log d}{n}+\frac{s^{2}\log ^{2}d}{n^{2}\epsilon ^{2}})</tex-math> <mml:math><mml:mrow><mml:mi>Ω</mml:mi><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mo form="prefix">log</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mi>n</mml:mi></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq13-3330775.gif"/> </inline-formula> of private minimax risk for PCA, under the statistical setting and <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq14-3330775.gif"/> </inline-formula>-DP model, in low and high dimensional sparse case respectively. Finally, extensive experiments on both synthetic and real-world data support our previous theoretical analysis. |
---|---|
ISSN: | 1041-4347 1558-2191 |
DOI: | 10.1109/TKDE.2023.3330775 |