Semi-Supervised Hashing for Large-Scale Search

Hashing-based approximate nearest neighbor (ANN) search in huge databases has become popular due to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resu...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on pattern analysis and machine intelligence Vol. 34; no. 12; pp. 2393 - 2406
Main Authors	Jun Wang, Kumar, S., Shih-Fu Chang
Format	Journal Article
Language	English
Published	Los Alamitos, CA IEEE 01.12.2012 IEEE Computer Society
Subjects	Applied sciences Artificial neural networks Binary codes Computer science; control theory; systems Encoding Exact sciences and technology Extraterrestrial measurements Hashing Information systems. Data bases Memory organisation. Data processing nearest neighbor search pairwise labels Semantics semi-supervised hashing Semisupervised learning Sequential analysis sequential hashing Software Spectral function Nearest neighbour binary codes Similarity Semantic analysis Locality Very large databases semi-supervised hashing Binary code sequential hashing nearest neighbor search Efficiency Sequential method Database Hashing Set theory Metric Large scale Random function pairwise labels Information theory
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Hashing-based approximate nearest neighbor (ANN) search in huge databases has become popular due to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or are inefficient. Moreover, these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. There exist supervised hashing methods that can handle such semantic similarity, but they are prone to overfitting when labeled data are small or noisy. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled sets. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, nonorthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0162-8828 1939-3539 2160-9292
DOI:	10.1109/TPAMI.2012.48