TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs
We introduce TeraHAC, a $(1+\epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+\epsilon)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm an...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
07.08.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We introduce TeraHAC, a $(1+\epsilon)$-approximate hierarchical agglomerative
clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm
is based on a new approach to computing $(1+\epsilon)$-approximate HAC, which
is a novel combination of the nearest-neighbor chain algorithm and the notion
of $(1+\epsilon)$-approximate HAC. Our approach allows us to partition the
graph among multiple machines and make significant progress in computing the
clustering within each partition before any communication with other partitions
is needed.
We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8
trillion edges. We show that TeraHAC requires over 100x fewer rounds compared
to previously known approaches for computing HAC. It is up to 8.3x faster than
SCC, the state-of-the-art distributed algorithm for hierarchical clustering,
while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the
quality of the celebrated HAC algorithm while significantly improving the
running time. |
---|---|
DOI: | 10.48550/arxiv.2308.03578 |