Adjusting the adjusted Rand Index A multinomial story

The Adjusted Rand Index ( ARI ) is arguably one of the most popular measures for cluster comparison. The adjustment of the ARI is based on a hypergeometric distribution assumption which is not satisfactory from a modeling point of view because (i) it is not appropriate when the two clusterings are d...

Full description

Saved in:

Bibliographic Details
Published in	Computational statistics Vol. 38; no. 1; pp. 327 - 347
Main Authors	Sundqvist, Martina, Chiquet, Julien, Rigaill, Guillem
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.03.2023
Subjects	Economic Theory/Quantitative Economics/Mathematical Methods Mathematics and Statistics Original Paper Probability and Statistics in Computer Science Probability Theory and Stochastic Processes Statistics Multinomial distribution Statistical inference Clustering Rand Index
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The Adjusted Rand Index ( ARI ) is arguably one of the most popular measures for cluster comparison. The adjustment of the ARI is based on a hypergeometric distribution assumption which is not satisfactory from a modeling point of view because (i) it is not appropriate when the two clusterings are dependent, (ii) it forces the size of the clusters, and (iii) it ignores the randomness of the sampling. In this work, we present a new "modified" version of the Rand Index. First, as in Russell et al. (J Malar Inst India 3(1), 1940 ), we consider only the pairs consistent by similarity and ignore the pairs consistent by difference to define the MRI . Second, we base the adjusted version, called MARI , on a multinomial distribution instead of a hypergeometric distribution. The multinomial model is advantageous because it does not force the size of the clusters, correctly models randomness and is easily extended to the dependent case. We show that ARI is biased under the multinomial model and that the difference between ARI and MARI can be significant for small n but essentially vanishes for large n , where n is the number of individuals. Finally, we provide an efficient algorithm to compute all these quantities (( A ) RI and M ( A ) RI ) based on a sparse representation of the contingency table in our aricode package. The space and time complexity is linear with respect to the number of samples and, more importantly, does not depend on the number of clusters as we do not explicitly compute the contingency table.
ISSN:	0943-4062 1613-9658
DOI:	10.1007/s00180-022-01230-7