Learning Deep Hierarchical Visual Feature Coding

In this paper, we propose a hybrid architecture that combines the image modeling strengths of the bag of words framework with the representational power and adaptability of learning deep architectures. Local gradient-based descriptors, such as SIFT, are encoded via a hierarchical coding scheme compo...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. 25; no. 12; pp. 2212 - 2225
Main Authors	Hanlin Goh, Thome, Nicolas, Cord, Matthieu, Joo-Hwee Lim
Format	Journal Article
Language	English
Published	United States IEEE 01.12.2014 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adaptation models Architecture Bag-of-words (BoW) framework Coding Computer architecture Computer Science computer vision deep learning Dictionaries dictionary learning Encoding hierarchical visual architecture image categorization Learning Learning systems Mathematical models Neural networks Representations restricted Boltzmann machine (RBM) sparse feature coding transfer learning Visual Visualization deep learning transfer learning restricted Boltzmann machine (RBM) sparse feature coding Bag-of-words (BoW) framework hierarchical visual architecture computer vision dictionary learning image categorization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we propose a hybrid architecture that combines the image modeling strengths of the bag of words framework with the representational power and adaptability of learning deep architectures. Local gradient-based descriptors, such as SIFT, are encoded via a hierarchical coding scheme composed of spatial aggregating restricted Boltzmann machines (RBM). For each coding layer, we regularize the RBM by encouraging representations to fit both sparse and selective distributions. Supervised fine-tuning is used to enhance the quality of the visual representation for the categorization task. We performed a thorough experimental evaluation using three image categorization data sets. The hierarchical coding scheme achieved competitive categorization accuracies of 79.7% and 86.4% on the Caltech-101 and 15-Scenes data sets, respectively. The visual representations learned are compact and the model's inference is fast, as compared with sparse coding methods. The low-level representations of descriptors that were learned using this method result in generic features that we empirically found to be transferrable between different image data sets. Further analysis reveal the significance of supervised fine-tuning when the architecture has two layers of representations as opposed to a single layer.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2014.2307532