Instance selection for big data based on locally sensitive hashing and double-voting mechanism

The increasing data volumes impose unprecedented challenges to traditional data mining in data preprocessing, learning, and analyzing, it has attracted much attention in designing efficient compressing, indexing and searching methods recently. Inspired by locally sensitive hashing (LSH), divide-and-...

Full description

Saved in:

Bibliographic Details
Published in	Advances in computational intelligence Vol. 2; no. 2; p. 20
Main Authors	Zhai, Junhai, Huang, Yajie
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 01.04.2022 Springer Nature B.V
Subjects	Accuracy Approximation Artificial Intelligence Big Data Compression ratio Computational Intelligence Data mining Datasets Deep learning Efficiency Engineering Feature selection Genetic algorithms Hash based algorithms Iterative methods Machine Learning Medical research Methods Original Article Random variables Open source platforms Instance selection Big data Locally sensitive hashing Voting mechanism
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The increasing data volumes impose unprecedented challenges to traditional data mining in data preprocessing, learning, and analyzing, it has attracted much attention in designing efficient compressing, indexing and searching methods recently. Inspired by locally sensitive hashing (LSH), divide-and-conquer strategy, and double-voting mechanism, we proposed an iterative instance selection algorithm, which needs to run p rounds iteratively to reduce or eliminate the unwanted bias of the optimal solution by double-voting. In each iteration, the proposed algorithm partitions the big dataset into several subsets and distributes them to different computing nodes. In each node, the instances in local data subset are transformed into Hamming space by l hash function in parallel, and each instance is assigned to one of the l hash tables by the corresponding hash code, the instances with the same hash code are put into the same bucket. And then, a proportion of instances are randomly selected from each hash bucket in each hash table, and a subset is obtained. Thus, totally l subsets are obtained, which are used for voting to select the locally optimal instance subset. The process is repeated p times to obtain p subsets. Finally, the globally optimal instance subset is obtained by voting with the p subsets. The proposed algorithm is implemented with two open source big data platforms, Hadoop and Spark, and experimentally compared with three state-of-the-art methods on testing accuracy, compression ratio, and running time. The experimental results demonstrate that the proposed algorithm provides excellent performance and outperforms three baseline methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2730-7794 2730-7808
DOI:	10.1007/s43674-022-00033-z