Parallel Classification of Spatial Points Into Geographical Regions

The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. Various data analysis methods could be applied this big data. One of these methods is to classify geo-tagged social network data in order to report geographical...

Full description

Saved in:
Bibliographic Details
Published in2019 18th International Symposium on Parallel and Distributed Computing (ISPDC) pp. 9 - 15
Main Authors Tarmur, Sanver, Ozturan, Can
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2019
Subjects
Online AccessGet full text
DOI10.1109/ISPDC.2019.000-3

Cover

Loading…
More Information
Summary:The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. Various data analysis methods could be applied this big data. One of these methods is to classify geo-tagged social network data in order to report geographical area associated with the data. We propose an efficient parallel classification approach and implement a classifier tool which is capable of processing huge amount of data. To test our approach, we collect Twitter data over five densest areas of Turkey. There are important factors affecting the classification performance such as the spatial indexing and the parallelization strategies. Hierarchical Triangular Mesh (HTM) and R-Tree spatial indexes are used for indexing regions. For parallel processing data streams classifier tool is implemented based on Apache Spark and Kafka platforms in order to obtain high scalability. To show effectiveness of our method, we perform tests on Amazon Web Services (AWS) Cloud environment and compare our method against a method which implements HTM on a Microsoft SQL Server. Results show that 1.6 - 4.5 fold speed-up is obtained and Twitter data that is collected over a month can be processed effectively in three hours.
DOI:10.1109/ISPDC.2019.000-3