Optical Character Recognition for Nastaleeq Printed Urdu Text using Histogram of Oriented Gradient Features

Authors

  • Awais Ahmad Department of Computer Science, Bahauddin Zakariya University, Multan, 60000, Pakistan
  • Fatima Yousaf Department of Computer Science, Bahauddin Zakariya University, Multan, 60000, Pakistan
  • Tanzeela Kousar Institute of Computer Science and Information Technology, The Women University Multan, 60000, Pakistan

DOI:

https://doi.org/10.66108/mna.v3i1.67

Keywords:

Urdu language, Optical Character Recognition, HOG features, Connected Components, Support Vector Machine

Abstract

The focus of research on optical character recognition (OCR) has been to digitize text in images. Urdu OCR is a challenging task because of its complexity, where a character can have multiple inflections depending on its position in the word, making it more difficult than English and similar languages. The proposed research aims to detect offline Urdu printed text using a segmentation-free approach, which means a holistic approach is taken. Horizontal histogram projection is used to extract text lines from an image, while connected components labelling is used for ligature segmentation in the extracted image to text line. To train the proposed model, a set of 14 statistical features along with HOG features are extracted for each sub-word/ligature. An open-source dataset UPTI is used to train and test the proposed algorithm, and SVM with RBF kernel function is used for the classification of ligatures. The proposed algorithm achieves a 97.3%-character recognition rate on the given dataset.

Downloads

Download data is not yet available.

Additional Files

Published

2024-03-14

How to Cite

Awais Ahmad, Yousaf, F., & Kousar, T. (2024). Optical Character Recognition for Nastaleeq Printed Urdu Text using Histogram of Oriented Gradient Features. Machines and Algorithms, 3(1), 28–42. https://doi.org/10.66108/mna.v3i1.67

Issue

Section

Articles

Categories