On the performance of high dimensional data clustering and classification algorithms

There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorit...

Full description

Saved in:

Bibliographic Details
Published in	Future generation computer systems Vol. 29; no. 4; pp. 1024 - 1034
Main Authors	Ericson, Kathleen, Pallickara, Shrideep
Format	Journal Article
Language	English
Published	Elsevier B.V 01.06.2013
Subjects	Classification Clustering Distributed stream processing Granules Hadoop Machine learning Mahout Distributed stream processing Machine learning Hadoop Mahout Classification Granules Clustering
Online Access	Get full text

Cover

Loading…

More Information
Summary:	There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery. ► We analyze distributed machine learning algorithms in stream and file based processing systems. ► We examine how each approach affects failure recovery. ► Case study of 4 clustering and 2 classification implementations.
ISSN:	0167-739X 1872-7115
DOI:	10.1016/j.future.2012.05.026