Online Algorithms with Limited Data Retention

We introduce a model of online algorithms subject to strict constraints on data retention. An online learning algorithm encounters a stream of data points, one per round, generated by some stationary process. Crucially, each data point can request that it be removed from memory $m$ rounds after it a...

Full description

Saved in:
Bibliographic Details
Main Authors Immorlica, Nicole, Lucier, Brendan, Mobius, Markus, Siderius, James
Format Journal Article
LanguageEnglish
Published 16.04.2024
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2404.10997

Cover

Loading…
Abstract We introduce a model of online algorithms subject to strict constraints on data retention. An online learning algorithm encounters a stream of data points, one per round, generated by some stationary process. Crucially, each data point can request that it be removed from memory $m$ rounds after it arrives. To model the impact of removal, we do not allow the algorithm to store any information or calculations between rounds other than a subset of the data points (subject to the retention constraints). At the conclusion of the stream, the algorithm answers a statistical query about the full dataset. We ask: what level of performance can be guaranteed as a function of $m$? We illustrate this framework for multidimensional mean estimation and linear regression problems. We show it is possible to obtain an exponential improvement over a baseline algorithm that retains all data as long as possible. Specifically, we show that $m = \textsc{Poly}(d, \log(1/\epsilon))$ retention suffices to achieve mean squared error $\epsilon$ after observing $O(1/\epsilon)$ $d$-dimensional data points. This matches the error bound of the optimal, yet infeasible, algorithm that retains all data forever. We also show a nearly matching lower bound on the retention required to guarantee error $\epsilon$. One implication of our results is that data retention laws are insufficient to guarantee the right to be forgotten even in a non-adversarial world in which firms merely strive to (approximately) optimize the performance of their algorithms. Our approach makes use of recent developments in the multidimensional random subset sum problem to simulate the progression of stochastic gradient descent under a model of adversarial noise, which may be of independent interest.
AbstractList We introduce a model of online algorithms subject to strict constraints on data retention. An online learning algorithm encounters a stream of data points, one per round, generated by some stationary process. Crucially, each data point can request that it be removed from memory $m$ rounds after it arrives. To model the impact of removal, we do not allow the algorithm to store any information or calculations between rounds other than a subset of the data points (subject to the retention constraints). At the conclusion of the stream, the algorithm answers a statistical query about the full dataset. We ask: what level of performance can be guaranteed as a function of $m$? We illustrate this framework for multidimensional mean estimation and linear regression problems. We show it is possible to obtain an exponential improvement over a baseline algorithm that retains all data as long as possible. Specifically, we show that $m = \textsc{Poly}(d, \log(1/\epsilon))$ retention suffices to achieve mean squared error $\epsilon$ after observing $O(1/\epsilon)$ $d$-dimensional data points. This matches the error bound of the optimal, yet infeasible, algorithm that retains all data forever. We also show a nearly matching lower bound on the retention required to guarantee error $\epsilon$. One implication of our results is that data retention laws are insufficient to guarantee the right to be forgotten even in a non-adversarial world in which firms merely strive to (approximately) optimize the performance of their algorithms. Our approach makes use of recent developments in the multidimensional random subset sum problem to simulate the progression of stochastic gradient descent under a model of adversarial noise, which may be of independent interest.
Author Immorlica, Nicole
Siderius, James
Mobius, Markus
Lucier, Brendan
Author_xml – sequence: 1
  givenname: Nicole
  surname: Immorlica
  fullname: Immorlica, Nicole
– sequence: 2
  givenname: Brendan
  surname: Lucier
  fullname: Lucier, Brendan
– sequence: 3
  givenname: Markus
  surname: Mobius
  fullname: Mobius, Markus
– sequence: 4
  givenname: James
  surname: Siderius
  fullname: Siderius, James
BackLink https://doi.org/10.48550/arXiv.2404.10997$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxMNEzNLC0NOdk0PXPy8nMS1VwzEnPL8osycgtVigHUgo-mbmZJakpCi6JJYkKQaklqXklmfl5PAysaYk5xam8UJqbQd7NNcTZQxdscHxBUWZuYlFlPMiCeLAFxoRVAAC4ajFg
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2404.10997
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2404_10997
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2404_109973
IEDL.DBID GOX
IngestDate Tue Jul 22 23:01:05 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2404_109973
OpenAccessLink https://arxiv.org/abs/2404.10997
ParticipantIDs arxiv_primary_2404_10997
PublicationCentury 2000
PublicationDate 2024-04-16
PublicationDateYYYYMMDD 2024-04-16
PublicationDate_xml – month: 04
  year: 2024
  text: 2024-04-16
  day: 16
PublicationDecade 2020
PublicationYear 2024
Score 3.7387755
SecondaryResourceType preprint
Snippet We introduce a model of online algorithms subject to strict constraints on data retention. An online learning algorithm encounters a stream of data points, one...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Data Structures and Algorithms
Computer Science - Learning
Title Online Algorithms with Limited Data Retention
URI https://arxiv.org/abs/2404.10997
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTewTDZPNTTRNUm0NNY1sUhM1bU0TjPWNbNMMjBPSTU2MQLfn-LrZ-YRauIVYRrBxKAA2wuTWFSRWQY5HzipWB9Y3ZiADjyyNGdmYDYyAi3ZcvePgExOgo_igqpHqAO2McFCSJWEmyADP7R1p-AIiQ4hBqbUPBEGXchxngqOOen5wK54Rm6xAmj0UwG6t0jBJbEkUSEI1HgFBZIog7yba4izhy7YgvgCyGkQ8SC748F2G4sxsAD77KkSDApG5papFsYmSSmJlikmBmmplqZphsBwSDEyTDWxMDZOlWSQwGWKFG4paQYuI2CdCprMMDSTYWApKSpNlQXWiSVJcuCAAQDg3GUD
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Online+Algorithms+with+Limited+Data+Retention&rft.au=Immorlica%2C+Nicole&rft.au=Lucier%2C+Brendan&rft.au=Mobius%2C+Markus&rft.au=Siderius%2C+James&rft.date=2024-04-16&rft_id=info:doi/10.48550%2Farxiv.2404.10997&rft.externalDocID=2404_10997