On the performance of high dimensional data clustering and classification algorithms

There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorit...

Full description

Saved in:
Bibliographic Details
Published inFuture generation computer systems Vol. 29; no. 4; pp. 1024 - 1034
Main Authors Ericson, Kathleen, Pallickara, Shrideep
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.06.2013
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery. ► We analyze distributed machine learning algorithms in stream and file based processing systems. ► We examine how each approach affects failure recovery. ► Case study of 4 clustering and 2 classification implementations.
ISSN:0167-739X
1872-7115
DOI:10.1016/j.future.2012.05.026