Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization
•To the best of our knowledge, Mask-CNN is the first end-to-end model that selects deep convolutional descriptors for object recognition, especially for fine-grained image recognition.•We present a novel and efficient part-based three-stream model for fine-grained recognition. By discarding the full...
Saved in:
Published in | Pattern recognition Vol. 76; pp. 704 - 714 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.04.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •To the best of our knowledge, Mask-CNN is the first end-to-end model that selects deep convolutional descriptors for object recognition, especially for fine-grained image recognition.•We present a novel and efficient part-based three-stream model for fine-grained recognition. By discarding the fully connected layers, the proposed M-CNN is computationally efficient (cf. Table 1 and Table 4 in experiments). Additionally, comparing with state-of-the-art methods, M-CNN has smaller feature dimensionality. Beyond those, it achieves the highest classification accuracy on CUB200-2011 and Birdsnap among published methods.•The part localization performance of the proposed model outperforms other part-based finegrained approaches which requires additional bounding boxes. In particular, M-CNN is 12.76% higher than state-of-the-art for head localization on CUB200-2011.
Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations. In this paper, we prove that selecting useful deep descriptors contributes well to fine-grained image recognition. Specifically, a novel Mask-CNN model without the fully connected layers is proposed. Based on the part annotations, the proposed model consists of a fully convolutional network to both locate the discriminative parts (e.g., head and torso), and more importantly generate weighted object/part masks for selecting useful and meaningful convolutional descriptors. After that, a three-stream Mask-CNN model is built for aggregating the selected object- and part-level descriptors simultaneously. Thanks to discarding the parameter redundant fully connected layers, our Mask-CNN has a small feature dimensionality and efficient inference speed by comparing with other fine-grained approaches. Furthermore, we obtain a new state-of-the-art accuracy on two challenging fine-grained bird species categorization datasets, which validates the effectiveness of both the descriptor selection scheme and the proposed Mask-CNN model. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2017.10.002 |