Dr. Kanimozhi Suguna | The Benchmarking Conference | LDC-IL

TAM-ORC: TAMIL ADAPTIVE MORPHOLOGICAL OCR WITH DEEP LEARNING-BASED BENCHMARKING FOR SCRIPT-SPECIFIC CHALLENGES

Dr. Kanimozhi Suguna

Assistant Professor
Department of Computer Applications
Arulmigu Arthanareeswarar Arts and Science College, Tiruchengode, Namakkal – Dt


Abstract

Unlike in the past when OCR relied on template based and statistical techniques, Tamil Optical Character Recognition (OCR) has advanced significantly with the introduction of machine learning and deep learning-based techniques. Unfortunately, OCR technologies in Tamil language continue to struggle with parsing complex ligatures, curved characters, and language specific differentiations, particularly for handwritten and aged documents. Strain dashed OCR systems generally face segmentation problems, low recognition precision in noisy surroundings, and failure to generalize over different styles and dialects of the language.

This research sets out to present a new Hybrid Morphological Attention Transformer (HMAT) model which aims to enhance morphological segmentation, character embeddings, and self attention, all folded into one model to augment OCR for Tamil. To improve the OCR performance on palm-leaf manuscripts, printed texts, and handwritten Tamil, the approach will adopt unsupervised domain adaptation approaches. Furthermore, this research will provide a framework for evaluation and benchmarking of different variant of Tamil OCR scripts based on the diverse character complexities, datasets, and practical environments in which these systems would have to be deployed.

The attempt to solve these issues will improve the adaptability, accuracy, and robustness of Tamil OCR systems, thus enhancing the digitization and pervasive preservation of Tamil texts. Additionally, this will assist in the creation of more refined and standardized metrics for evaluating Tamil OCR that will help model engineering by providing useful comparability and preferential ranking for practical usability and linguistic fidelity.

Keywords: Tamil OCR, Deep Learning, Morphological Adaptation, Self-Attention Mechanism, Optical Character Recognition, Benchmarking, Script-Specific Challenges, Unsupervised Domain Adaptation, Character Embeddings, Evaluation Metrics