Dr. Nimrita Koul | The Benchmarking Conference | LDC-IL

Building and evaluating an OCR System for the Sharda Script

Dr. Nimrita Koul

Faculty (Associate Professor)
REVA University, Bangalore. ( School of CSE )


Authors : Dr. Nimrita Koul, Sudhanva M Athreya, PVS Kumar

Abstract

The Sharda script, an ancient Brahmic script historically used for writing Sanskrit and Kashmiri, has largely fallen into disuse, with limited digital resources available for its preservation and study. This paper presents our work on developing a deep learning-based Optical Character Recognition (OCR) system for Sharda, designed to extract text from publicly available scanned historical manuscripts. The task of OCR in this domain presents significant challenges, including noisy, torn, and incomplete scans, as well as variability in handwriting styles across different manuscripts, leading to reduced recognition accuracy.

As part of this work, we curated a digital dataset of 900 pages from scriptures such as The Bhagavad Gita, Ganesha Stotram, Devi Mahatmya, Devi Stuthi, and Ram Gita and build an end-to-end text recognition system. Our system consists of two key stages: (1) Text Segmentation, where a model predicts text-line masks for subsequent recognition, and (2) Text Recognition, where the extracted text is processed using our deep Convolutional Recurrent Neural Network (CRNN) for handwritten text recognition.

Due to limited annotated data, we employed a two-phase training strategy: pretraining and fine-tuning. During pretraining, an autoencoder-based approach was used for both images and text labels to learn unsupervised features. During fine-tuning we used a Convolutional Recurrent Neural Network. Using CRNN allowed the CNN backbone to retain spatial information while the RNN head captured semantic patterns, enabling generalization across manuscripts with limited annotations. The model achieved 87.7% accuracy when trained on 11,000 image-text pairs, demonstrating its effectiveness in recognizing handwritten Sharda script. Furthermore, this system can be extended to other Brahmi-based scripts, making it a valuable tool for Indian language preservation.

A critical challenge in OCR for underrepresented scripts is the lack of standardized evaluation benchmarks. We propose a structured evaluation methodology for assessing Sharda OCR performance, identifying key challenges such as diacritic positioning, ligature formations, and document degradation. We outline essential evaluation metrics—including Character Error Rate (CER) and Word Error Rate (WER)—and suggest dataset curation strategies to establish a benchmarking standard for future OCR models in Indian languages.

Keywords: Sharda Script, Optical Character Recognition, OCR Evaluation, Indian Languages, Deep Learning, Character Error Rate, Handwritten Text Recognition

Acknowledgement: This work was carried out as a part of research grant received from DST Government of India under the project title “An AI Based System for Preservation and Revival of Sharda Script” (Grant number DST/TDT/SHRI-14/2021).