Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability t...

Full description

Saved in:
Bibliographic Details
Published in2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS) Vol. 30; pp. 553 - 561
Main Authors Geldenhuys, Morgan K., Pfister, Benjamin J. J., Scheinert, Dominik, Thamsen, Lauritz, Kao, Odej
Format Conference Proceeding Journal Article
LanguageEnglish
Published Polish Information Processing Society 01.01.2022
Subjects
Online AccessGet full text
ISSN2300-5963
DOI10.15439/2022F225

Cover

Abstract Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead.In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of cloud orchestration technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.
AbstractList Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead.In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of cloud orchestration technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.
Author Kao, Odej
Scheinert, Dominik
Thamsen, Lauritz
Geldenhuys, Morgan K.
Pfister, Benjamin J. J.
Author_xml – sequence: 1
  givenname: Morgan K.
  surname: Geldenhuys
  fullname: Geldenhuys, Morgan K.
  email: morgan.k.geldenhuys@tu-berlin.de
  organization: Technische Universität,Berlin,Germany
– sequence: 2
  givenname: Benjamin J. J.
  surname: Pfister
  fullname: Pfister, Benjamin J. J.
  email: benjamin.j.j.pfister@tu-berlin.de
  organization: Technische Universität,Berlin,Germany
– sequence: 3
  givenname: Dominik
  surname: Scheinert
  fullname: Scheinert, Dominik
  email: dominik.scheinert@tu-berlin.de
  organization: Technische Universität,Berlin,Germany
– sequence: 4
  givenname: Lauritz
  surname: Thamsen
  fullname: Thamsen, Lauritz
  email: lauritz.thamsen@tu-berlin.de
  organization: University of Glasgow,United Kingdom
– sequence: 5
  givenname: Odej
  surname: Kao
  fullname: Kao, Odej
  email: odej.kao@tu-berlin.de
  organization: Technische Universität,Berlin,Germany
BookMark eNo9kE9Lw0AUxFdRsNYePHvJF6ju_2S9SWu1WKiggrfwsvvSbk2yYRMP9dMbrXiZ4Q3Dj8eck5MmNEjIJaPXTElhbjjlfMG5OiITk2aZMJpLPugxGXFB6VQZLc7IpOt2lFLOJOVSj8j70xZCd5vM9w3U3kJV7ZN12_vaf_lmk8y2aD_a4Jv-5ypDTObYYuOgqDCZ-66Pvvjs0SUvfUSok-cYLHbdUL4gpyVUHU7-fEzeFvevs8fpav2wnN2tpk5Q2U9LK8EIhtqlFnmBtmQp5TrTGQow1DiTWQSRKqGsdWi0lYiKpQiYaQFSjMnywHUBdnkbfQ1xnwfw-W8Q4iaH2HtbYe6Y5qJwtGTMSA4OkGnmhEJXICjrBtbVgeUR8Z9lhjeGucQ3tc5tgw
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IL
CBEJK
RIE
RIL
DOA
DOI 10.15439/2022F225
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DOAJ Directory of Open Access Journals
DatabaseTitleList
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9788396242396
8396242399
EISSN 2300-5963
EndPage 561
ExternalDocumentID oai_doaj_org_article_d1623bd0f11942adae161d35edbea5cd
9909140
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
6IF
6IN
AAJGR
AAWTH
ABLEC
ADBBV
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BCNDV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CHZPO
GROUPED_DOAJ
IEGSK
M~E
OCL
OK1
Y2W
ID FETCH-LOGICAL-d304t-fc4a931e6d7ce2becf17026868e3a909d98cea37535ccde96c4ee517eae863a43
IEDL.DBID DOA
IngestDate Wed Aug 27 01:14:48 EDT 2025
Thu Jan 18 11:14:33 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-d304t-fc4a931e6d7ce2becf17026868e3a909d98cea37535ccde96c4ee517eae863a43
OpenAccessLink https://doaj.org/article/d1623bd0f11942adae161d35edbea5cd
PageCount 9
ParticipantIDs doaj_primary_oai_doaj_org_article_d1623bd0f11942adae161d35edbea5cd
ieee_primary_9909140
PublicationCentury 2000
PublicationDate 2022-01-01
PublicationDateYYYYMMDD 2022-01-01
PublicationDate_xml – month: 01
  year: 2022
  text: 2022-01-01
  day: 01
PublicationDecade 2020
PublicationTitle 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS)
PublicationTitleAbbrev FedCSIS
PublicationYear 2022
Publisher Polish Information Processing Society
Publisher_xml – name: Polish Information Processing Society
SSID ssj0002140246
Score 2.2356923
Snippet Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their...
SourceID doaj
ieee
SourceType Open Website
Publisher
StartPage 553
SubjectTerms Chaos
Chaos Engineering
Cloud
Distributed Stream Processing
Fault tolerance
Fault tolerant systems
Parallel processing
Parallel Profiling
Probability
QoS Modeling
Quality of service
Runtime
Runtime Optimization
SummonAdditionalLinks – databaseName: IEEE Electronic Library (IEL)
  dbid: RIE
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJyZALaK85IGRlCZ2HmakpapABQYqdYsu9kWt-kgF7UB_PWcnFIEY2CLLcqzz6fyd7fs-xq4koVQQCjwtMPAkgvIyGwxDk5sYO0io2RY4D5-iwUg-jMNxjV3vamEQ0T0-w7b9dHf5ptAbe1R2Q5FTUUKwx_bIzcparYosKKR91ebxQT-woteOfv-HXorbLvoHbPj1o_KVyKy9WWdtvf3FwfjfmRyy5ndhHn_ZbTlHrIbLBhs_TqB4v-W9Ulwe5vMP_kyRYDHdUifenaCerYqpk4TghFF5zwnf2pop3rO8uVbyCg23F9Sw4FXpAHVuslH__rU78CrBBM-Ijlx7uZaghI-RiTUGtDq5H1OOlUQJCqAZG5VoBEEZSqi1QRVpiRj6MQImkQApjll9WSzxhHElBEQiplCYSKlQqgCT3GColZ-DiJIWu7M2TlclJ0ZqWapdA5kprZw-NT6Bq8x0ct9XMgADSPjSiBBNhhBq02INa9rdIJVVT_9uPmP7doXL849zVl-_bfCCEME6u3Su8Aknwrrr
  priority: 102
  providerName: IEEE
Title Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing
URI https://ieeexplore.ieee.org/document/9909140
https://doaj.org/article/d1623bd0f11942adae161d35edbea5cd
Volume 30
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELZQJxZeBVEelQfWiMSvxGzQUlUgYKFSt-hiX9SKvgRlgF_POQmoTCysUWRLd_Hd9yW572PsQhFKBWkhchJFpBBsVIRiqH3pU4yRUHMYcH54NMORuhvr8YbVV_gnrJYHrgN36RNq0IWPy4TotgAPSBjFS42-QNDOh-ob23iDTIUaLIg3CGUaKSFNXTewfDEQwRK7Euf_5aZSNZPBHttpUCC_rnffZ1u4OGC73w4LvDlwbTa-n8Dy7Yr3a994mM0--BMd8vn0k1oO703QvayW08rtgRP85P3K0zaMQ_F-kMQNblboefj2DHPeTAXQzYdsNLh97g2jxgsh8jJW66h0CqxM0PjUoaDAl0lK9CkzGUqwsfU2cwiSyId2zqM1TiHqJEXAzEhQ8oi1FssFHjNupQQjU6pymVIWlRWYlR61s0kJ0mQddhMClK9quYs8CFBXFygteZOW_K-0dFg7hPdnEWp6lnJy8h9rn7LtkMz6RcgZa61f3_GcoMG66FZPQbea4vsCwpi7FQ
linkProvider Directory of Open Access Journals
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAC1CLeeGAkpYntJGakpSr0AUMrdYsc-6JWfQrSgf56zkkoAjGwRZblWGfr7jvb932E3HBEqYpJ5WgGnsNBSSe2zlCYxARQB0TNtsC51_fbQ_48EqMSud3WwgBA9vgMavYzu8s3S722R2V36DklJgQ7ZBfjPhd5tVZBFyQwstpM3mt5VvY6I-D_oZiSBYzWAel9_Sp_JzKtrdO4pje_WBj_O5dDUv0uzaOv26BzREqwqJBRZ6yW7_e0mcvLq9nsg76gL5hPNtiJNsagp6vlJBOFoIhSaTOTvrVVU7RpmXOt6BUYaq-o1ZwWxQPYuUqGrcdBo-0UkgmOYXWeOonmSjIXfBNo8HB9EjfALCv0Q2AKZ2xkqEExzFGE1gakrzmAcANQEPpMcXZMyovlAk4IlYwpnwXoDEPOJXDpQZgYEFq6iWJ-eEoerI2jVc6KEVme6qwBzRQV2z4yLsKr2NQT15XcU0YBIkzDBJgYlNDmlFSsabeDFFY9-7v5muy1B71u1H3qd87Jvl3t_DTkgpTTtzVcIj5I46tsW3wCF_i-OA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+17th+Conference+on+Computer+Science+and+Intelligence+Systems+%28FedCSIS%29&rft.atitle=Khaos%3A+Dynamically+Optimizing+Checkpointing+for+Dependable+Distributed+Stream+Processing&rft.au=Geldenhuys%2C+Morgan+K.&rft.au=Pfister%2C+Benjamin+J.+J.&rft.au=Scheinert%2C+Dominik&rft.au=Thamsen%2C+Lauritz&rft.date=2022-01-01&rft.pub=Polish+Information+Processing+Society&rft.spage=553&rft.epage=561&rft_id=info:doi/10.15439%2F2022F225&rft.externalDocID=9909140