Spatial coding-based approach for partitioning big spatial data in Hadoop
Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial oper...
Saved in:
Published in | Computers & geosciences Vol. 106; pp. 60 - 67 |
---|---|
Main Authors | , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.09.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.
•Spatial coding-based approach (SCA) for partitioning big spatial data was presented.•Hilbert coding-based approach (HCA) was implemented over MapReduce.•The contrast experimental study about spatial partition between random sampling and HCA was acted.•We used five real datasets and three measurement standards in tests, and got good results. |
---|---|
ISSN: | 0098-3004 1873-7803 |
DOI: | 10.1016/j.cageo.2017.05.014 |