Assessing disclosure risks for synthetic data with arbitrary intruder knowledge
Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the...
Saved in:
Published in | Statistical journal of the IAOS Vol. 32; no. 1; pp. 109 - 126 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
London, England
SAGE Publications
27.02.2016
|
Subjects | |
Online Access | Get full text |
ISSN | 1874-7655 1875-9254 |
DOI | 10.3233/SJI-160957 |
Cover
Abstract | Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the released synthetic values to estimate confidential values for individuals in the collected data. We demonstrate and investigate this potential risk using two simple but informative scenarios: a single continuous variable possibly with outliers, and a three-way contingency table possibly with small counts in some cells. Beginning with the case that the intruder knows all but one value in the confidential data, we examine the effect on risk of decreasing the number of observations the intruder knows beforehand. We generally find that releasing synthetic data (1) can pose little risk to records in the middle of the distribution, and (2) can pose some risks to extreme outliers, although arguably these risks are mild. We also find that the effect of removing observations from an intruder's background knowledge heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and drops quickly if he/she cannot. |
---|---|
AbstractList | Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the released synthetic values to estimate confidential values for individuals in the collected data. We demonstrate and investigate this potential risk using two simple but informative scenarios: a single continuous variable possibly with outliers, and a three-way contingency table possibly with small counts in some cells. Beginning with the case that the intruder knows all but one value in the confidential data, we examine the effect on risk of decreasing the number of observations the intruder knows beforehand. We generally find that releasing synthetic data (1) can pose little risk to records in the middle of the distribution, and (2) can pose some risks to extreme outliers, although arguably these risks are mild. We also find that the effect of removing observations from an intruder's background knowledge heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and drops quickly if he/she cannot. |
Author | Reiter, Jerome P. McClure, David |
Author_xml | – sequence: 1 givenname: David surname: McClure fullname: McClure, David organization: Department of Statistical Science – sequence: 2 givenname: Jerome P. surname: Reiter fullname: Reiter, Jerome P. organization: Department of Statistical Science |
BookMark | eNptkE1PAjEQhhuDiYBe_AW9mZisbr-29EiIIoaEg3rezHZbKKy7plNC-Peu4slwmjk87-SdZ0QGbdc6Qm5Z_iC4EI9vr4uMFblR-oIM2USrzHAlB7-7zHSh1BUZIW7zXBkt5ZCspogOMbRrWge0TYf76GgMuEPqu0jx2KaNS8HSGhLQQ0gbCrEKKUI80tCmuK9dpLu2OzSuXrtrcumhQXfzN8fk4_npffaSLVfzxWy6zCxnLGUWLHCVFxNdea-kll5owRlYxSsnpRXMQ2GYl1AD08ZY43khWe0r7yQwLsbk_nTXxg4xOl9-xfDZdypZXv6oKHsV5UlFD-f_YBsSpND19SE05yN3pwjC2pXbbh_b_p1z5Deg1HFF |
CitedBy_id | crossref_primary_10_1214_18_AOAS1194 crossref_primary_10_1214_24_STS927 crossref_primary_10_3233_SJI_160999 crossref_primary_10_1007_s10207_022_00607_5 crossref_primary_10_3390_a17060249 |
Cites_doi | 10.3102/1076998613480394 10.1109/ICDE.2008.4497436 10.1007/978-3-319-11257-2_15 10.1007/978-3-540-87471-3_20 10.29012/jpc.v2i2.589 10.1111/j.1467-985X.2004.00343.x 10.1198/016214507000000932 10.1145/1217299.1217302 10.1007/s10182-008-0090-1 10.29012/jpc.v6i1.635 10.1080/10618600.2013.844700 10.1111/j.1751-5823.2011.00153.x 10.1198/jasa.2009.tm08439 10.1109/SP.2008.33 |
ContentType | Journal Article |
Copyright | IOS Press and the authors. All rights reserved |
Copyright_xml | – notice: IOS Press and the authors. All rights reserved |
DBID | AAYXX CITATION |
DOI | 10.3233/SJI-160957 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Statistics |
EISSN | 1875-9254 |
EndPage | 126 |
ExternalDocumentID | 10_3233_SJI_160957 10.3233_SJI-160957 |
GroupedDBID | 0R~ 4.4 8V8 AAFNC AAGLT AAOTM AAQXI ABDBF ABEHJ ABJNI ABUBZ ABUJY ACGFO ACGFS ACPQW ACUHS ADZMO AEGXH AEJQA AEMOZ AFRHK AFYTF AGIAB AHDMH AHQJS AJNRN AKVCP ALMA_UNASSIGNED_HOLDINGS AMVHM ARTOV B0M CAG COF EAD EAP EAS EBA EBE EBO EBR EBS EBU ECR EJD EMK EOH EPL ESX H13 HZ~ I-F IL9 IOS J8X K1G MET MIO MV1 NGNOM NIF O9- SAUOL SCNPE SFC TH9 TUS AAPII AAYXX AJGYC CITATION |
ID | FETCH-LOGICAL-c211t-caca250687bff5474f37321ac52be44c31fa691f4ada1799c9f2641dfbfe4a123 |
ISSN | 1874-7655 |
IngestDate | Wed Sep 10 05:45:41 EDT 2025 Thu Apr 24 23:08:22 EDT 2025 Tue Jun 17 22:29:19 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Keywords | synthetic disclosure risk Confidentiality |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c211t-caca250687bff5474f37321ac52be44c31fa691f4ada1799c9f2641dfbfe4a123 |
PageCount | 18 |
ParticipantIDs | crossref_primary_10_3233_SJI_160957 crossref_citationtrail_10_3233_SJI_160957 sage_journals_10_3233_SJI_160957 |
PublicationCentury | 2000 |
PublicationDate | 2016-02-27 |
PublicationDateYYYYMMDD | 2016-02-27 |
PublicationDate_xml | – month: 02 year: 2016 text: 2016-02-27 day: 27 |
PublicationDecade | 2010 |
PublicationPlace | London, England |
PublicationPlace_xml | – name: London, England |
PublicationTitle | Statistical journal of the IAOS |
PublicationYear | 2016 |
Publisher | SAGE Publications |
Publisher_xml | – name: SAGE Publications |
References | Reiter, Raghunathan 2007; 102 Rubin 1993; 9 Si, Reiter 2013; 38 Winkler 1995; 1 Dunson, Xing 2009; 104 Reiter 2005; 168 Drechsler, Bender, Rässler 2008; 1 Kinney, Reiter, Reznek, Miranda, Jarmin, Abowd 2011; 79 Reiter 2002; 18 Drechsler, Dundler, Bender, Rässler, Zwick 2008; 92 Charest 2010; 2 Machanavajjhala, Kifer, Gehrke, Venkitasubramaniam 2007; 1 Reiter, Wang, Zhang 2014; 6 Manrique-Vallier, Reiter 2014; 23 McClure, Reiter 2012; 5 Abowd (10.3233/SJI-160957_ref16) 2008 Kinney (10.3233/SJI-160957_ref12) 2011; 79 10.3233/SJI-160957_ref15 10.3233/SJI-160957_ref17 10.3233/SJI-160957_ref11 10.3233/SJI-160957_ref22 10.3233/SJI-160957_ref13 Reiter (10.3233/SJI-160957_ref3) 2002; 18 Drechsler (10.3233/SJI-160957_ref8) 2008; 92 Rubin (10.3233/SJI-160957_ref1) 1993; 9 Reiter (10.3233/SJI-160957_ref5) 2007; 102 10.3233/SJI-160957_ref10 10.3233/SJI-160957_ref21 Manrique-Vallier (10.3233/SJI-160957_ref25) 2014; 23 Reiter (10.3233/SJI-160957_ref19) 2014; 6 Dunson (10.3233/SJI-160957_ref23) 2009; 104 Hu (10.3233/SJI-160957_ref20) 2014 Reiter (10.3233/SJI-160957_ref4) 2005; 168 Kennickell (10.3233/SJI-160957_ref6) 1997 Drechsler (10.3233/SJI-160957_ref9) 2008; 1 Winkler (10.3233/SJI-160957_ref14) 1995; 1 McClure (10.3233/SJI-160957_ref18) 2012; 5 Si (10.3233/SJI-160957_ref24) 2013; 38 |
References_xml | – volume: 79 start-page: 363 year: 2011 end-page: 384 article-title: Towards unrestricted public use business microdata: The synthetic longitudinal business database publication-title: International Statistical Review – volume: 2 issue: 2 year: 2010 article-title: How can we analyze differentially-private synthetic datasets publication-title: Journal of Privacy and Confidentiality – volume: 9 start-page: 461 issue: 2 year: 1993 end-page: 468 article-title: Discussion: statistical disclosure limitation publication-title: Journal of Official Statistics – volume: 1 start-page: 1002 year: 2008 end-page: 1027 article-title: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel publication-title: Transactions on Data Privacy – volume: 1 start-page: 355 year: 1995 end-page: 384 article-title: Matching and record linkage publication-title: Business Survey Methods – volume: 1 issue: 3 year: 2007 article-title: l-Diversity: privacy beyond k-anonymity publication-title: ACM Transaction on Knowledge Discovery from Data – volume: 23 start-page: 1061 issue: 4 year: 2014 end-page: 1079 article-title: Bayesian estimation of discrete multivariate latent structure models with structural zeros publication-title: Journal of Computational and Graphical Statistics – volume: 6 start-page: 17 issue: 1 year: 2014 end-page: 33 article-title: Bayesian estimation of disclosure risks for multiply imputed, synthetic data publication-title: Journal of Privacy and Confidentiality – volume: 18 start-page: 1 issue: 4 year: 2002 end-page: 19 article-title: Satisfying disclosure restrictions with synthetic data sets publication-title: Journal of Official Statistics – volume: 168 start-page: 185 issue: 1 year: 2005 end-page: 205 article-title: Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study publication-title: Journal of the Royal Statistical Society Series A: Statistics in Society – volume: 38 start-page: 499 issue: 5 year: 2013 end-page: 521 article-title: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys publication-title: Journal of Educational and Behavioral Statistics – volume: 5 start-page: 535 issue: 3 year: 2012 end-page: 552 article-title: Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data publication-title: Transactions on Data Privacy – volume: 102 start-page: 1462 issue: 480 year: 2007 end-page: 1471 article-title: The multiple adaptations of multiple imputation publication-title: Journal of the American Statistical Association – volume: 92 start-page: 439 year: 2008 end-page: 458 article-title: A new approach for disclosure control in the IAB Establishment Panel-multiple data imputation for a better data access publication-title: Advances in Statistical Analysis – volume: 104 start-page: 1042 issue: 487 year: 2009 end-page: 1051 article-title: Nonparametric Bayes modeling of multivariate categorical data publication-title: Journal of American Statistical Association – volume: 5 start-page: 535 issue: 3 year: 2012 ident: 10.3233/SJI-160957_ref18 article-title: Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data publication-title: Transactions on Data Privacy – volume: 38 start-page: 499 issue: 5 year: 2013 ident: 10.3233/SJI-160957_ref24 article-title: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys publication-title: Journal of Educational and Behavioral Statistics doi: 10.3102/1076998613480394 – ident: 10.3233/SJI-160957_ref11 doi: 10.1109/ICDE.2008.4497436 – start-page: 185 volume-title: Privacy in Statistical Databases year: 2014 ident: 10.3233/SJI-160957_ref20 doi: 10.1007/978-3-319-11257-2_15 – ident: 10.3233/SJI-160957_ref22 – start-page: 239 volume-title: Privacy in Statistical Databases year: 2008 ident: 10.3233/SJI-160957_ref16 doi: 10.1007/978-3-540-87471-3_20 – ident: 10.3233/SJI-160957_ref17 doi: 10.29012/jpc.v2i2.589 – ident: 10.3233/SJI-160957_ref21 – volume: 168 start-page: 185 issue: 1 year: 2005 ident: 10.3233/SJI-160957_ref4 article-title: Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study publication-title: Journal of the Royal Statistical Society Series A: Statistics in Society doi: 10.1111/j.1467-985X.2004.00343.x – volume: 102 start-page: 1462 issue: 480 year: 2007 ident: 10.3233/SJI-160957_ref5 article-title: The multiple adaptations of multiple imputation publication-title: Journal of the American Statistical Association doi: 10.1198/016214507000000932 – ident: 10.3233/SJI-160957_ref15 doi: 10.1145/1217299.1217302 – volume: 92 start-page: 439 year: 2008 ident: 10.3233/SJI-160957_ref8 article-title: A new approach for disclosure control in the IAB Establishment Panel-multiple data imputation for a better data access publication-title: Advances in Statistical Analysis doi: 10.1007/s10182-008-0090-1 – volume: 6 start-page: 17 issue: 1 year: 2014 ident: 10.3233/SJI-160957_ref19 article-title: Bayesian estimation of disclosure risks for multiply imputed, synthetic data publication-title: Journal of Privacy and Confidentiality doi: 10.29012/jpc.v6i1.635 – volume: 23 start-page: 1061 issue: 4 year: 2014 ident: 10.3233/SJI-160957_ref25 article-title: Bayesian estimation of discrete multivariate latent structure models with structural zeros publication-title: Journal of Computational and Graphical Statistics doi: 10.1080/10618600.2013.844700 – volume: 9 start-page: 461 issue: 2 year: 1993 ident: 10.3233/SJI-160957_ref1 article-title: Discussion: statistical disclosure limitation publication-title: Journal of Official Statistics – volume: 18 start-page: 1 issue: 4 year: 2002 ident: 10.3233/SJI-160957_ref3 article-title: Satisfying disclosure restrictions with synthetic data sets publication-title: Journal of Official Statistics – ident: 10.3233/SJI-160957_ref10 – volume: 79 start-page: 363 year: 2011 ident: 10.3233/SJI-160957_ref12 article-title: Towards unrestricted public use business microdata: The synthetic longitudinal business database publication-title: International Statistical Review doi: 10.1111/j.1751-5823.2011.00153.x – volume: 1 start-page: 355 year: 1995 ident: 10.3233/SJI-160957_ref14 article-title: Matching and record linkage publication-title: Business Survey Methods – volume: 104 start-page: 1042 issue: 487 year: 2009 ident: 10.3233/SJI-160957_ref23 article-title: Nonparametric Bayes modeling of multivariate categorical data publication-title: Journal of American Statistical Association doi: 10.1198/jasa.2009.tm08439 – start-page: 248 volume-title: Record Linkage Techniques year: 1997 ident: 10.3233/SJI-160957_ref6 – volume: 1 start-page: 1002 year: 2008 ident: 10.3233/SJI-160957_ref9 article-title: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel publication-title: Transactions on Data Privacy – ident: 10.3233/SJI-160957_ref13 doi: 10.1109/SP.2008.33 |
SSID | ssj0059744 |
Score | 2.063964 |
Snippet | Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to... |
SourceID | crossref sage |
SourceType | Enrichment Source Index Database Publisher |
StartPage | 109 |
Title | Assessing disclosure risks for synthetic data with arbitrary intruder knowledge |
URI | https://journals.sagepub.com/doi/full/10.3233/SJI-160957 |
Volume | 32 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA66XryIT3wT0ItI16bNNtuDh_Wx6IIPfIC3kqYJLMgqunvx1zuTtGmVBR-XUsK0lJmvycxk5gsh-zyRBRNKB6oosCUngV8qDnWQSqF1yBnPbaPw1XVy8cgHT52nmlDBdpeM87b6mNpX8h-rwhjYFbtk_2BZ_1IYgHuwL1zBwnD9lY3dji0G-9hc-_yC2T5bLG5JFpCNANw7ZGTFOtCyje0tH9pOeySKeJsgk4RPqzUdVXRCLYfzV3YJW2vZu6kT2eoU69q_VcdPju_0sDzxY6CREOHwtt3MLzBbkuza9RsbW1NTiHicXyASR7Pb1tVYJ0gjRwxdzbB1BtMjyU2XLEwbKy9zvfPfJ_U4wqRz_35wGTCkxxP10uULCkuxrBaaJXORELhzP9c7OTvpV8szBlC21KD6eMdZi08f1U9_8VIaJX7W63hYJAtluEB7zvZLZEaPlsm8N877CrnxIKA1CKgFAQUQUA8CiiCgCALqQUArEFAPglXy2D9_OL0IylMyAgXB-zhQUknwY5OuyI3pcMFNLOKISdWJcs25ipmRScoMl4VE-j-VGnCCWWFyo7kEx2WNtEYvI71OaFwoFSoT5bIIeaKiLqhCGt0NdWhSeNkGOai0kqmSQh5PMnnOIJSs9J85DW6QPS_76ohTpkpRVG5WAvl9isjmzyJbZL5G7jZpgeb0DviK43y3tP4n_HFrWg |
linkProvider | EBSCOhost |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Assessing+disclosure+risks+for+synthetic+data+with+arbitrary+intruder+knowledge&rft.jtitle=Statistical+journal+of+the+IAOS&rft.au=McClure%2C+David&rft.au=Reiter%2C+Jerome+P.&rft.date=2016-02-27&rft.pub=SAGE+Publications&rft.issn=1874-7655&rft.eissn=1875-9254&rft.volume=32&rft.issue=1&rft.spage=109&rft.epage=126&rft_id=info:doi/10.3233%2FSJI-160957&rft.externalDocID=10.3233_SJI-160957 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1874-7655&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1874-7655&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1874-7655&client=summon |