Kurdish Text Segmentation using Projection-Based Approaches

An optical character recognition (OCR) system may be the solution to data entry problems for saving the printed document as a soft copy of them. Therefore, OCR systems are being developed for all languages, and Kurdish is no exception. Kurdish is one of the languages that present special challenges...

Full description

Saved in:
Bibliographic Details
Published inUHD Journal of Science and Technology Vol. 5; no. 1; pp. 56 - 65
Main Authors Tofiq, Tofiq Ahmed, Hussein, Jamal Ali
Format Journal Article
LanguageEnglish
Published University of Human Development 16.05.2021
Subjects
Online AccessGet full text
ISSN2521-4209
2521-4217
DOI10.21928/uhdjst.v5n1y2021.pp56-65

Cover

Loading…
More Information
Summary:An optical character recognition (OCR) system may be the solution to data entry problems for saving the printed document as a soft copy of them. Therefore, OCR systems are being developed for all languages, and Kurdish is no exception. Kurdish is one of the languages that present special challenges to OCR. The main challenge of Kurdish is that it is mostly cursive. Therefore, a segmentation process must be able to specify the beginning and end of the characters. This step is important for character recognition. This paper presents an algorithm for Kurdish character segmentation. The proposed algorithm uses the projection-based approach concepts to separate lines, words, and characters. The algorithm works through the vertical projection of a word and then identifies the splitting areas of the word characters. Then, a post-processing stage is used to handle the over-segmentation problems that occur in the initial segmentation stage. The proposed method is tested using a data set consisting of images of texts that vary in font size, type, and style of more than 63,000 characters. The experiments show that the proposed algorithm can segment Kurdish words with an average accuracy of 98.6%.
ISSN:2521-4209
2521-4217
DOI:10.21928/uhdjst.v5n1y2021.pp56-65