Robust and Scalable Content-and-Structure Indexing

Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer...

Full description

Saved in:
Bibliographic Details
Published inThe VLDB journal
Main Authors Wellenzohn, Kevin, Böhlen, Michael H., Helmer, Sven, Pietri, Antoine, Zacchiroli, Stefano
Format Journal Article
LanguageEnglish
Published Springer 2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our triebased RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publiclyavailable source code archive.
AbstractList Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our triebased RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publiclyavailable source code archive.
Author Zacchiroli, Stefano
Böhlen, Michael H.
Helmer, Sven
Wellenzohn, Kevin
Pietri, Antoine
Author_xml – sequence: 1
  givenname: Kevin
  surname: Wellenzohn
  fullname: Wellenzohn, Kevin
  organization: Department of Informatics [Zurich]
– sequence: 2
  givenname: Michael H.
  surname: Böhlen
  fullname: Böhlen, Michael H.
  organization: Department of Informatics [Zurich]
– sequence: 3
  givenname: Sven
  surname: Helmer
  fullname: Helmer, Sven
  organization: Department of Informatics [Zurich]
– sequence: 4
  givenname: Antoine
  surname: Pietri
  fullname: Pietri, Antoine
  organization: Direction générale déléguée à l'innovation
– sequence: 5
  givenname: Stefano
  orcidid: 0000-0002-4576-136X
  surname: Zacchiroli
  fullname: Zacchiroli, Stefano
  organization: Laboratoire Traitement et Communication de l'Information
BackLink https://hal.science/hal-03787268$$DView record in HAL
BookMark eNqVisEKwiAAQCUWtFX_4LWD4FypO8YoFnRqHbqJ26wWpqEu6u9b0A_0Lg8eLwGRsUaNQIzzZY44Y6cIxCmmFPGBCUi8v2GMCSGrGJCDrXsfoDQtrBqpZa0VLKwJygQ0RFQF1zehdwruTKtenbnMwPgstVfzn6dgsd0cixJdpRYP192lewsrO1Gu9-LbcMY4I5Q_0-yf9wPIWTrl
ContentType Journal Article
Copyright Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml – notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID 1XC
VOOES
DatabaseName Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 0949-877X
ExternalDocumentID oai_HAL_hal_03787268v1
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
-~X
.4S
.86
.DC
.VR
06D
0R~
123
1N0
1SB
1XC
2.D
203
29R
2J2
2JN
2JY
2KG
2KM
2LR
2P1
2VQ
2~H
3-Y
30V
4.4
406
408
409
40D
40E
5QI
5VS
67Z
6NX
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAEOY
AAGNY
AAHNG
AAIAL
AAJBT
AAJKR
AAKMM
AALFJ
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAWTV
AAYFX
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFS
ACHSB
ACHXU
ACKNC
ACM
ACMDZ
ACMLO
ACOKC
ACOMO
ACZOJ
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADL
ADQRH
ADRFC
ADTPH
ADURQ
ADYFF
ADZKW
AEBTG
AEBYY
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AENSD
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFLOW
AFQWF
AFWIH
AFWTZ
AFWXC
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMXSW
AMYLF
AMYQR
AOCGG
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
B-.
BA0
BBWZM
BDATZ
BGNMA
CAG
CCLIF
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
EBLON
EBS
EDO
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNWQR
GQ6
GQ7
GQ8
GUFHI
GXS
H13
HF~
HG5
HG6
HGAVV
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I07
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
KDC
KOV
KOW
LAS
LHSKQ
LLZTM
M4Y
MA-
N2Q
N9A
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
P0-
P19
P2P
P9O
PF0
PT4
PT5
QOK
QOS
R4E
R89
R9I
RHV
RIG
RNI
RNS
ROL
RPX
RSV
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
VOOES
W23
W48
W7O
WK8
YLTOR
YZZ
Z45
Z7R
Z7X
Z83
Z88
Z8M
Z8R
Z8W
Z92
ZMTXR
~EX
ID FETCH-hal_primary_oai_HAL_hal_03787268v13
ISSN 1066-8888
IngestDate Thu Oct 24 06:48:57 EDT 2024
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
License Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel OpenURL
MergedId FETCHMERGED-hal_primary_oai_HAL_hal_03787268v13
ORCID 0000-0002-4576-136X
0000-0002-4576-136X
OpenAccessLink https://hal.science/hal-03787268
ParticipantIDs hal_primary_oai_HAL_hal_03787268v1
PublicationCentury 2000
PublicationDate 2022
PublicationDateYYYYMMDD 2022-01-01
PublicationDate_xml – year: 2022
  text: 2022
PublicationDecade 2020
PublicationTitle The VLDB journal
PublicationYear 2022
Publisher Springer
Publisher_xml – name: Springer
SSID ssj0002225
Score 4.7143836
Snippet Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the...
SourceID hal
SourceType Open Access Repository
SubjectTerms Computer Science
Software Engineering
Title Robust and Scalable Content-and-Structure Indexing
URI https://hal.science/hal-03787268
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3Na8IwFA_D0y77HvsmjF2GRGJsa3usc1I3J7Lq8FZsjVQZ7WCdB__6vSRN60CY2yWUtKRJfuHl5b2830PozqRsakaUErs-gwOKbc2IYzsRqfM6a_LQnISWiHd-6VveyHgam-MyN5-MLsnCWrTaGFfyH1ShDnAVUbJ_QLZoFCrgGfCFEhCGciuMX9Pw61NdEfdhrmUUlKSbSjIClcSX5LDCRdAVpIh6l1qUK-St125V1_8lnTQivcoqjRMVuLOcFwuoJfzqLSt-5-t37qterTSp6nQs_rIMMhvMRd6unKsg1Z783NbAyjOptjKuSUrQVQgcn5Xw5LlN0XBAvDbH5eaiHeqe6weDdifodfvPP98WJNee2wtiwIA2QIgwy15KTgIQOiClRswt9ldxQpU-7LwDoBXE2goutYLhAdrL1XnsKmwO0Q5PjtC-TpWBc8l5jJiCCgMqWEOFN0KFNVQn6L7zOHzwiOjthyIFCTaPoHGKKkma8DOE6TQC3dimbGbItG9OZNgsooJcyJhOTOcc3f7e3sU2H12iXQGeMhRdoQoMgF-D6pSFN3ImvwEkoiEn
link.rule.ids 230,315,783,787,888,4031
linkProvider Springer Nature
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Robust+and+Scalable+Content-and-Structure+Indexing&rft.jtitle=The+VLDB+journal&rft.au=Wellenzohn%2C+Kevin&rft.au=B%C3%B6hlen%2C+Michael+H.&rft.au=Helmer%2C+Sven&rft.au=Pietri%2C+Antoine&rft.date=2022&rft.pub=Springer&rft.issn=1066-8888&rft.eissn=0949-877X&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai_HAL_hal_03787268v1
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-8888&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-8888&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-8888&client=summon