Spatial coding-based approach for partitioning big spatial data in Hadoop

Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial oper...

Full description

Saved in:

Bibliographic Details
Published in	Computers & geosciences Vol. 106; pp. 60 - 67
Main Authors	Yao, Xiaochuang, Mokbel, Mohamed F., Alarabi, Louai, Eldawy, Ahmed, Yang, Jianyu, Yun, Wenju, Li, Lin, Ye, Sijing, Zhu, Dehai
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.09.2017
Subjects	Big spatial data Hadoop Spatial coding-based approach Spatial data partitioning Big spatial data Spatial data partitioning Hadoop Spatial coding-based approach
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems. •Spatial coding-based approach (SCA) for partitioning big spatial data was presented.•Hilbert coding-based approach (HCA) was implemented over MapReduce.•The contrast experimental study about spatial partition between random sampling and HCA was acted.•We used five real datasets and three measurement standards in tests, and got good results.
ISSN:	0098-3004 1873-7803
DOI:	10.1016/j.cageo.2017.05.014