Assessing disclosure risks for synthetic data with arbitrary intruder knowledge

Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the...

Full description

Saved in:
Bibliographic Details
Published inStatistical journal of the IAOS Vol. 32; no. 1; pp. 109 - 126
Main Authors McClure, David, Reiter, Jerome P.
Format Journal Article
LanguageEnglish
Published London, England SAGE Publications 27.02.2016
Subjects
Online AccessGet full text
ISSN1874-7655
1875-9254
DOI10.3233/SJI-160957

Cover

Abstract Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the released synthetic values to estimate confidential values for individuals in the collected data. We demonstrate and investigate this potential risk using two simple but informative scenarios: a single continuous variable possibly with outliers, and a three-way contingency table possibly with small counts in some cells. Beginning with the case that the intruder knows all but one value in the confidential data, we examine the effect on risk of decreasing the number of observations the intruder knows beforehand. We generally find that releasing synthetic data (1) can pose little risk to records in the middle of the distribution, and (2) can pose some risks to extreme outliers, although arguably these risks are mild. We also find that the effect of removing observations from an intruder's background knowledge heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and drops quickly if he/she cannot.
AbstractList Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the released synthetic values to estimate confidential values for individuals in the collected data. We demonstrate and investigate this potential risk using two simple but informative scenarios: a single continuous variable possibly with outliers, and a three-way contingency table possibly with small counts in some cells. Beginning with the case that the intruder knows all but one value in the confidential data, we examine the effect on risk of decreasing the number of observations the intruder knows beforehand. We generally find that releasing synthetic data (1) can pose little risk to records in the middle of the distribution, and (2) can pose some risks to extreme outliers, although arguably these risks are mild. We also find that the effect of removing observations from an intruder's background knowledge heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and drops quickly if he/she cannot.
Author Reiter, Jerome P.
McClure, David
Author_xml – sequence: 1
  givenname: David
  surname: McClure
  fullname: McClure, David
  organization: Department of Statistical Science
– sequence: 2
  givenname: Jerome P.
  surname: Reiter
  fullname: Reiter, Jerome P.
  organization: Department of Statistical Science
BookMark eNptkE1PAjEQhhuDiYBe_AW9mZisbr-29EiIIoaEg3rezHZbKKy7plNC-Peu4slwmjk87-SdZ0QGbdc6Qm5Z_iC4EI9vr4uMFblR-oIM2USrzHAlB7-7zHSh1BUZIW7zXBkt5ZCspogOMbRrWge0TYf76GgMuEPqu0jx2KaNS8HSGhLQQ0gbCrEKKUI80tCmuK9dpLu2OzSuXrtrcumhQXfzN8fk4_npffaSLVfzxWy6zCxnLGUWLHCVFxNdea-kll5owRlYxSsnpRXMQ2GYl1AD08ZY43khWe0r7yQwLsbk_nTXxg4xOl9-xfDZdypZXv6oKHsV5UlFD-f_YBsSpND19SE05yN3pwjC2pXbbh_b_p1z5Deg1HFF
CitedBy_id crossref_primary_10_1214_18_AOAS1194
crossref_primary_10_1214_24_STS927
crossref_primary_10_3233_SJI_160999
crossref_primary_10_1007_s10207_022_00607_5
crossref_primary_10_3390_a17060249
Cites_doi 10.3102/1076998613480394
10.1109/ICDE.2008.4497436
10.1007/978-3-319-11257-2_15
10.1007/978-3-540-87471-3_20
10.29012/jpc.v2i2.589
10.1111/j.1467-985X.2004.00343.x
10.1198/016214507000000932
10.1145/1217299.1217302
10.1007/s10182-008-0090-1
10.29012/jpc.v6i1.635
10.1080/10618600.2013.844700
10.1111/j.1751-5823.2011.00153.x
10.1198/jasa.2009.tm08439
10.1109/SP.2008.33
ContentType Journal Article
Copyright IOS Press and the authors. All rights reserved
Copyright_xml – notice: IOS Press and the authors. All rights reserved
DBID AAYXX
CITATION
DOI 10.3233/SJI-160957
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Statistics
EISSN 1875-9254
EndPage 126
ExternalDocumentID 10_3233_SJI_160957
10.3233_SJI-160957
GroupedDBID 0R~
4.4
8V8
AAFNC
AAGLT
AAOTM
AAQXI
ABDBF
ABEHJ
ABJNI
ABUBZ
ABUJY
ACGFO
ACGFS
ACPQW
ACUHS
ADZMO
AEGXH
AEJQA
AEMOZ
AFRHK
AFYTF
AGIAB
AHDMH
AHQJS
AJNRN
AKVCP
ALMA_UNASSIGNED_HOLDINGS
AMVHM
ARTOV
B0M
CAG
COF
EAD
EAP
EAS
EBA
EBE
EBO
EBR
EBS
EBU
ECR
EJD
EMK
EOH
EPL
ESX
H13
HZ~
I-F
IL9
IOS
J8X
K1G
MET
MIO
MV1
NGNOM
NIF
O9-
SAUOL
SCNPE
SFC
TH9
TUS
AAPII
AAYXX
AJGYC
CITATION
ID FETCH-LOGICAL-c211t-caca250687bff5474f37321ac52be44c31fa691f4ada1799c9f2641dfbfe4a123
ISSN 1874-7655
IngestDate Wed Sep 10 05:45:41 EDT 2025
Thu Apr 24 23:08:22 EDT 2025
Tue Jun 17 22:29:19 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Keywords synthetic
disclosure
risk
Confidentiality
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c211t-caca250687bff5474f37321ac52be44c31fa691f4ada1799c9f2641dfbfe4a123
PageCount 18
ParticipantIDs crossref_primary_10_3233_SJI_160957
crossref_citationtrail_10_3233_SJI_160957
sage_journals_10_3233_SJI_160957
PublicationCentury 2000
PublicationDate 2016-02-27
PublicationDateYYYYMMDD 2016-02-27
PublicationDate_xml – month: 02
  year: 2016
  text: 2016-02-27
  day: 27
PublicationDecade 2010
PublicationPlace London, England
PublicationPlace_xml – name: London, England
PublicationTitle Statistical journal of the IAOS
PublicationYear 2016
Publisher SAGE Publications
Publisher_xml – name: SAGE Publications
References Reiter, Raghunathan 2007; 102
Rubin 1993; 9
Si, Reiter 2013; 38
Winkler 1995; 1
Dunson, Xing 2009; 104
Reiter 2005; 168
Drechsler, Bender, Rässler 2008; 1
Kinney, Reiter, Reznek, Miranda, Jarmin, Abowd 2011; 79
Reiter 2002; 18
Drechsler, Dundler, Bender, Rässler, Zwick 2008; 92
Charest 2010; 2
Machanavajjhala, Kifer, Gehrke, Venkitasubramaniam 2007; 1
Reiter, Wang, Zhang 2014; 6
Manrique-Vallier, Reiter 2014; 23
McClure, Reiter 2012; 5
Abowd (10.3233/SJI-160957_ref16) 2008
Kinney (10.3233/SJI-160957_ref12) 2011; 79
10.3233/SJI-160957_ref15
10.3233/SJI-160957_ref17
10.3233/SJI-160957_ref11
10.3233/SJI-160957_ref22
10.3233/SJI-160957_ref13
Reiter (10.3233/SJI-160957_ref3) 2002; 18
Drechsler (10.3233/SJI-160957_ref8) 2008; 92
Rubin (10.3233/SJI-160957_ref1) 1993; 9
Reiter (10.3233/SJI-160957_ref5) 2007; 102
10.3233/SJI-160957_ref10
10.3233/SJI-160957_ref21
Manrique-Vallier (10.3233/SJI-160957_ref25) 2014; 23
Reiter (10.3233/SJI-160957_ref19) 2014; 6
Dunson (10.3233/SJI-160957_ref23) 2009; 104
Hu (10.3233/SJI-160957_ref20) 2014
Reiter (10.3233/SJI-160957_ref4) 2005; 168
Kennickell (10.3233/SJI-160957_ref6) 1997
Drechsler (10.3233/SJI-160957_ref9) 2008; 1
Winkler (10.3233/SJI-160957_ref14) 1995; 1
McClure (10.3233/SJI-160957_ref18) 2012; 5
Si (10.3233/SJI-160957_ref24) 2013; 38
References_xml – volume: 79
  start-page: 363
  year: 2011
  end-page: 384
  article-title: Towards unrestricted public use business microdata: The synthetic longitudinal business database
  publication-title: International Statistical Review
– volume: 2
  issue: 2
  year: 2010
  article-title: How can we analyze differentially-private synthetic datasets
  publication-title: Journal of Privacy and Confidentiality
– volume: 9
  start-page: 461
  issue: 2
  year: 1993
  end-page: 468
  article-title: Discussion: statistical disclosure limitation
  publication-title: Journal of Official Statistics
– volume: 1
  start-page: 1002
  year: 2008
  end-page: 1027
  article-title: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel
  publication-title: Transactions on Data Privacy
– volume: 1
  start-page: 355
  year: 1995
  end-page: 384
  article-title: Matching and record linkage
  publication-title: Business Survey Methods
– volume: 1
  issue: 3
  year: 2007
  article-title: l-Diversity: privacy beyond k-anonymity
  publication-title: ACM Transaction on Knowledge Discovery from Data
– volume: 23
  start-page: 1061
  issue: 4
  year: 2014
  end-page: 1079
  article-title: Bayesian estimation of discrete multivariate latent structure models with structural zeros
  publication-title: Journal of Computational and Graphical Statistics
– volume: 6
  start-page: 17
  issue: 1
  year: 2014
  end-page: 33
  article-title: Bayesian estimation of disclosure risks for multiply imputed, synthetic data
  publication-title: Journal of Privacy and Confidentiality
– volume: 18
  start-page: 1
  issue: 4
  year: 2002
  end-page: 19
  article-title: Satisfying disclosure restrictions with synthetic data sets
  publication-title: Journal of Official Statistics
– volume: 168
  start-page: 185
  issue: 1
  year: 2005
  end-page: 205
  article-title: Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study
  publication-title: Journal of the Royal Statistical Society Series A: Statistics in Society
– volume: 38
  start-page: 499
  issue: 5
  year: 2013
  end-page: 521
  article-title: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys
  publication-title: Journal of Educational and Behavioral Statistics
– volume: 5
  start-page: 535
  issue: 3
  year: 2012
  end-page: 552
  article-title: Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data
  publication-title: Transactions on Data Privacy
– volume: 102
  start-page: 1462
  issue: 480
  year: 2007
  end-page: 1471
  article-title: The multiple adaptations of multiple imputation
  publication-title: Journal of the American Statistical Association
– volume: 92
  start-page: 439
  year: 2008
  end-page: 458
  article-title: A new approach for disclosure control in the IAB Establishment Panel-multiple data imputation for a better data access
  publication-title: Advances in Statistical Analysis
– volume: 104
  start-page: 1042
  issue: 487
  year: 2009
  end-page: 1051
  article-title: Nonparametric Bayes modeling of multivariate categorical data
  publication-title: Journal of American Statistical Association
– volume: 5
  start-page: 535
  issue: 3
  year: 2012
  ident: 10.3233/SJI-160957_ref18
  article-title: Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data
  publication-title: Transactions on Data Privacy
– volume: 38
  start-page: 499
  issue: 5
  year: 2013
  ident: 10.3233/SJI-160957_ref24
  article-title: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys
  publication-title: Journal of Educational and Behavioral Statistics
  doi: 10.3102/1076998613480394
– ident: 10.3233/SJI-160957_ref11
  doi: 10.1109/ICDE.2008.4497436
– start-page: 185
  volume-title: Privacy in Statistical Databases
  year: 2014
  ident: 10.3233/SJI-160957_ref20
  doi: 10.1007/978-3-319-11257-2_15
– ident: 10.3233/SJI-160957_ref22
– start-page: 239
  volume-title: Privacy in Statistical Databases
  year: 2008
  ident: 10.3233/SJI-160957_ref16
  doi: 10.1007/978-3-540-87471-3_20
– ident: 10.3233/SJI-160957_ref17
  doi: 10.29012/jpc.v2i2.589
– ident: 10.3233/SJI-160957_ref21
– volume: 168
  start-page: 185
  issue: 1
  year: 2005
  ident: 10.3233/SJI-160957_ref4
  article-title: Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study
  publication-title: Journal of the Royal Statistical Society Series A: Statistics in Society
  doi: 10.1111/j.1467-985X.2004.00343.x
– volume: 102
  start-page: 1462
  issue: 480
  year: 2007
  ident: 10.3233/SJI-160957_ref5
  article-title: The multiple adaptations of multiple imputation
  publication-title: Journal of the American Statistical Association
  doi: 10.1198/016214507000000932
– ident: 10.3233/SJI-160957_ref15
  doi: 10.1145/1217299.1217302
– volume: 92
  start-page: 439
  year: 2008
  ident: 10.3233/SJI-160957_ref8
  article-title: A new approach for disclosure control in the IAB Establishment Panel-multiple data imputation for a better data access
  publication-title: Advances in Statistical Analysis
  doi: 10.1007/s10182-008-0090-1
– volume: 6
  start-page: 17
  issue: 1
  year: 2014
  ident: 10.3233/SJI-160957_ref19
  article-title: Bayesian estimation of disclosure risks for multiply imputed, synthetic data
  publication-title: Journal of Privacy and Confidentiality
  doi: 10.29012/jpc.v6i1.635
– volume: 23
  start-page: 1061
  issue: 4
  year: 2014
  ident: 10.3233/SJI-160957_ref25
  article-title: Bayesian estimation of discrete multivariate latent structure models with structural zeros
  publication-title: Journal of Computational and Graphical Statistics
  doi: 10.1080/10618600.2013.844700
– volume: 9
  start-page: 461
  issue: 2
  year: 1993
  ident: 10.3233/SJI-160957_ref1
  article-title: Discussion: statistical disclosure limitation
  publication-title: Journal of Official Statistics
– volume: 18
  start-page: 1
  issue: 4
  year: 2002
  ident: 10.3233/SJI-160957_ref3
  article-title: Satisfying disclosure restrictions with synthetic data sets
  publication-title: Journal of Official Statistics
– ident: 10.3233/SJI-160957_ref10
– volume: 79
  start-page: 363
  year: 2011
  ident: 10.3233/SJI-160957_ref12
  article-title: Towards unrestricted public use business microdata: The synthetic longitudinal business database
  publication-title: International Statistical Review
  doi: 10.1111/j.1751-5823.2011.00153.x
– volume: 1
  start-page: 355
  year: 1995
  ident: 10.3233/SJI-160957_ref14
  article-title: Matching and record linkage
  publication-title: Business Survey Methods
– volume: 104
  start-page: 1042
  issue: 487
  year: 2009
  ident: 10.3233/SJI-160957_ref23
  article-title: Nonparametric Bayes modeling of multivariate categorical data
  publication-title: Journal of American Statistical Association
  doi: 10.1198/jasa.2009.tm08439
– start-page: 248
  volume-title: Record Linkage Techniques
  year: 1997
  ident: 10.3233/SJI-160957_ref6
– volume: 1
  start-page: 1002
  year: 2008
  ident: 10.3233/SJI-160957_ref9
  article-title: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel
  publication-title: Transactions on Data Privacy
– ident: 10.3233/SJI-160957_ref13
  doi: 10.1109/SP.2008.33
SSID ssj0059744
Score 2.063964
Snippet Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to...
SourceID crossref
sage
SourceType Enrichment Source
Index Database
Publisher
StartPage 109
Title Assessing disclosure risks for synthetic data with arbitrary intruder knowledge
URI https://journals.sagepub.com/doi/full/10.3233/SJI-160957
Volume 32
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA66XryIT3wT0ItI16bNNtuDh_Wx6IIPfIC3kqYJLMgqunvx1zuTtGmVBR-XUsK0lJmvycxk5gsh-zyRBRNKB6oosCUngV8qDnWQSqF1yBnPbaPw1XVy8cgHT52nmlDBdpeM87b6mNpX8h-rwhjYFbtk_2BZ_1IYgHuwL1zBwnD9lY3dji0G-9hc-_yC2T5bLG5JFpCNANw7ZGTFOtCyje0tH9pOeySKeJsgk4RPqzUdVXRCLYfzV3YJW2vZu6kT2eoU69q_VcdPju_0sDzxY6CREOHwtt3MLzBbkuza9RsbW1NTiHicXyASR7Pb1tVYJ0gjRwxdzbB1BtMjyU2XLEwbKy9zvfPfJ_U4wqRz_35wGTCkxxP10uULCkuxrBaaJXORELhzP9c7OTvpV8szBlC21KD6eMdZi08f1U9_8VIaJX7W63hYJAtluEB7zvZLZEaPlsm8N877CrnxIKA1CKgFAQUQUA8CiiCgCALqQUArEFAPglXy2D9_OL0IylMyAgXB-zhQUknwY5OuyI3pcMFNLOKISdWJcs25ipmRScoMl4VE-j-VGnCCWWFyo7kEx2WNtEYvI71OaFwoFSoT5bIIeaKiLqhCGt0NdWhSeNkGOai0kqmSQh5PMnnOIJSs9J85DW6QPS_76ohTpkpRVG5WAvl9isjmzyJbZL5G7jZpgeb0DviK43y3tP4n_HFrWg
linkProvider EBSCOhost
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Assessing+disclosure+risks+for+synthetic+data+with+arbitrary+intruder+knowledge&rft.jtitle=Statistical+journal+of+the+IAOS&rft.au=McClure%2C+David&rft.au=Reiter%2C+Jerome+P.&rft.date=2016-02-27&rft.pub=SAGE+Publications&rft.issn=1874-7655&rft.eissn=1875-9254&rft.volume=32&rft.issue=1&rft.spage=109&rft.epage=126&rft_id=info:doi/10.3233%2FSJI-160957&rft.externalDocID=10.3233_SJI-160957
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1874-7655&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1874-7655&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1874-7655&client=summon