SHC: Distributed Query Processing for Non-Relational Data Store

We introduce a simple data model to process non-relational data for relational operations, and SHC (Apache Spark - Apache HBase Connector), an implementation of this model in the cluster computing framework, Spark. SHC leverages optimization techniques of relational data processing over the distribu...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE 34th International Conference on Data Engineering (ICDE) pp. 1465 - 1476
Main Authors Weiqing Yang, Mingjie Tang, Yongyang Yu, Yanbo Liang, Saha, Bikas
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.04.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We introduce a simple data model to process non-relational data for relational operations, and SHC (Apache Spark - Apache HBase Connector), an implementation of this model in the cluster computing framework, Spark. SHC leverages optimization techniques of relational data processing over the distributed and column-oriented key-value store (i.e., HBase). Compared to existing systems, SHC makes two major contributions. At first, SHC offers a much tighter integration between optimizations of relational data processing and non-relational data store, through a plug-in implementation that integrates with Spark SQL, a distributed in-memory computing engine for relational data. The design makes the system maintenance relatively easy, and enables users to perform complex data analytics on top of key-value store. Second, SHC leverages the Spark SQL Catalyst engine for high performance query optimizations and processing, e.g., data partitions pruning, columns pruning, predicates pushdown and data locality. SHC has been deployed and used in multiple production environments with hundreds of nodes, and provides OLAP query processing on petabytes of data efficiently.
ISSN:2375-026X
DOI:10.1109/ICDE.2018.00165