Optical Character Recognition for Nastaleeq Printed Urdu Text using Histogram of Oriented Gradient Features
DOI:
https://doi.org/10.66108/mna.v3i1.67Keywords:
Urdu language, Optical Character Recognition, HOG features, Connected Components, Support Vector MachineAbstract
The focus of research on optical character recognition (OCR) has been to digitize text in images. Urdu OCR is a challenging task because of its complexity, where a character can have multiple inflections depending on its position in the word, making it more difficult than English and similar languages. The proposed research aims to detect offline Urdu printed text using a segmentation-free approach, which means a holistic approach is taken. Horizontal histogram projection is used to extract text lines from an image, while connected components labelling is used for ligature segmentation in the extracted image to text line. To train the proposed model, a set of 14 statistical features along with HOG features are extracted for each sub-word/ligature. An open-source dataset UPTI is used to train and test the proposed algorithm, and SVM with RBF kernel function is used for the classification of ligatures. The proposed algorithm achieves a 97.3%-character recognition rate on the given dataset.
Downloads
Additional Files
Published
How to Cite
License
© This work is published by Machines and Algorithms and licensed under the terms of Creative Commons Attribution 4.0 International License (CC BY 4.0).
