A simple and effective approach for border noise removal from document images

When digitizing bound material like books or magazines, marginal noise appears along the page border. This noise consists of undesired text parts from the neighboring page and/or speckles that result from the binarization process. When a keyword based search is performed in a digitized collection, t...

Full description

Saved in:
Bibliographic Details
Published in2009 IEEE 13th International Multitopic Conference pp. 1 - 5
Main Authors Shafait, F., Breuel, T.M.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2009
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:When digitizing bound material like books or magazines, marginal noise appears along the page border. This noise consists of undesired text parts from the neighboring page and/or speckles that result from the binarization process. When a keyword based search is performed in a digitized collection, textual noise in particular poses problems since the returned search results might correspond to textual noise instead of actual contents of the page. Manually removing marginal noise for each page is not feasible in large scale digitization projects. In this paper, we present a simple and effective approach for removing both textual and non-textual noise by finding borders of noise regions using projection profile analysis. We demonstrate the effectiveness of our approach by evaluating it quantitatively on the widely used University of Washington (UW3) dataset. The results show that our approach reduces the noise ratio from 70% to 20% while retaining more than 99% of actual page contents. Comparison with state-of-the-art approaches shows that our algorithm performs comparable to them, while being simple to understand and easy to implement. We also provide an open source implementation of our method as part of the OCRopus OCR system.
ISBN:1424448727
9781424448722
DOI:10.1109/INMIC.2009.5383115