Unsupervised Word Clustering Using Deep Features

Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this pa...

Full description

Saved in:
Bibliographic Details
Published in2016 12th IAPR Workshop on Document Analysis Systems (DAS) pp. 263 - 268
Main Authors Kulkarni, Mandar, Karande, Shirish Subhash, Lodha, Sachin
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.04.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this paper, we propose a deep learning architecture based approach for unsupervised word clustering. An edge responsive untrained Convolutional Neural Network (CNN) is used as a feature extractor. Graph connected component analysis is applied on the similarity graph computed from the word features. Our approach inherently detects similar shape patterns at word level and hence, it is language agnostic. We validated our approach against multiple state of art word matching techniques. Experimental results show that our approach significantly outperforms all of them on variety of data sets. In addition, the approach is observed to be robust to word segmentation errors, font style/size variations.
DOI:10.1109/DAS.2016.14