I/O efficient structural clustering and maintenance of clusters for large-scale graphs

In recent years, the size of graph data has increased significantly, but most existing graph clustering algorithms do not consider the case where the size of main memory is not sufficient to handle large amount of graph data. Exploring entire region of graph for clustering causes too many random dis...

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 168; p. 114221
Main Authors Seo, Jung Hyuk, Kim, Myoung Ho
Format Journal Article
LanguageEnglish
Published New York Elsevier Ltd 15.04.2021
Elsevier BV
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In recent years, the size of graph data has increased significantly, but most existing graph clustering algorithms do not consider the case where the size of main memory is not sufficient to handle large amount of graph data. Exploring entire region of graph for clustering causes too many random disk accesses to use data that are not loaded into memory, resulting in excessive disk I/O and thrashing. To address this problem, we propose an I/O-efficient algorithm for structural clustering of a graph, called pm-SCAN. In the proposed method, if memory is insufficient, an input graph is partitioned into several subgraphs smaller than memory, and clustering is first performed for each subgraph. And then clusters from the subgraphs are merged based on connectivity between clusters so that global results can be obtained in the point of view of an original input graph. Not only does pm-SCAN produce scalable performance even for very large graphs, i.e., significant shortage of available memory, but also the result of pm-SCAN is the same as that of the original structural clustering algorithm SCAN. We also propose a cluster maintenance method for large-scale dynamic graphs that change over time. Instead of reclustering with a whole graph, only a small set of nodes whose structural connectivities are subject to change by a given update operation is first identified, and we access only those nodes in disk and update their clusters to reduce maintenance costs. This dynamic graph handling mechanism shows significant performance improvement compared to the existing method and the baseline that performs clustering from scratch. •We study clustering and cluster maintenance for graphs larger than main memory.•Clustering is performed on several memory-sized subgraphs to avoid thrashing.•Cluster membership of nodes is correctly updated for changes in dynamic graphs.•Real-world graphs with billions of edges can be processed when lack of memory.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2020.114221