Unsupervised Word Clustering Using Deep Features
Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this pa...
Saved in:
Published in | 2016 12th IAPR Workshop on Document Analysis Systems (DAS) pp. 263 - 268 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.04.2016
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this paper, we propose a deep learning architecture based approach for unsupervised word clustering. An edge responsive untrained Convolutional Neural Network (CNN) is used as a feature extractor. Graph connected component analysis is applied on the similarity graph computed from the word features. Our approach inherently detects similar shape patterns at word level and hence, it is language agnostic. We validated our approach against multiple state of art word matching techniques. Experimental results show that our approach significantly outperforms all of them on variety of data sets. In addition, the approach is observed to be robust to word segmentation errors, font style/size variations. |
---|---|
DOI: | 10.1109/DAS.2016.14 |