Character Recognition refers to the conversion of printed or handwritten characters to a machine-interpretable form, or in other terms, the “reading” of text. The term has been used to address three very distinct language technologies with different applications.
“Online” handwriting recognition or Online HWR refers to the interpretation of handwriting captured dynamically using a handheld or tablet device. It allows the creation of more natural handwriting-based alternatives to keyboards for data entry in Indian scripts, and also for imparting of handwriting skills using computers.
“Offline” handwriting recognition or Offline HWR refers to the interpretation of handwriting captured statically as an image. It can be used for the interpretation of handwriting already recorded on paper, ranging from filled-in forms to handwritten manuscripts.
Optical character recognition or OCR refers to the interpretation of printed text captured as an image. It can be used for conversion of printed or typewritten material such as books and documents into electronic form.
These different areas of language technology require different algorithms and linguistic resources. However for convenience, they have been combined under the “character recognition” umbrella. They are all hard research problems because of the variety of writing styles and fonts encountered. Of these, OCR has seen some research in a few Indian scripts because of support from the TDIL program. However the technology is not yet mature and there is only one commercial offering. Also, there are no common linguistic resources that can be used by the community. The other areas of Online and Offline HWR have seen very little research overall in the context of Indian scripts and no linguistic resources exist.
- Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR.
- Promotion of development of these technologies.
- Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts.
This will be achieved in variety of ways:
- Standards development will primarily be via a mixture of email discussions and face-to-face meetings of working group members organized under the aegis of LDC-IL.
- Tool development will be given as projects to technology institutions with the necessary inclination, skills and resources.
- Linguistic data collection, annotation and validation will be given as projects to linguistics/computational linguistics departments of Institutes and universities with the necessary inclination, skills and resources. However for each linguistic resource developed, validation will be performed by a different institution than the one doing the collection and annotation. Use of the linguistic resources for technology development will be promoted by arranging periodic competitions (for example, for recognition of online handwritten words in specific scripts) and by objective evaluation of performance.
3. Implementation Phases
Specifically linguistic data collection will be done two phases.
Phase I (year 1-3)
Standards are key to the creation of shared linguistic resources. The LDC-IL will adopt established processes for proposing and advancing standards, working with international standards bodies wherever applicable. Standards will be proposed for datasets of offline handwriting, offline handwriting and documents, and for printed characters.
- Development of tools for data collection
The availability of good tools will allow researchers to start collecting data in different Indian scripts, and contribute data to LDC-IL. They are a must in order to extend support to all Indian scripts quickly. The design and development of tools for data collection and dataset creation in all three target technologies will be done.
- Promotion of technology development for specific tasks in selected scripts
The LDC-IL will promote the development and implementation of technology for Online HWR, Offline HWR and OCR in the context of specific tasks and selected scripts.
The tasks could be
- to interpret a line of handwriting captured using a handheld computer
- to interpret a form that has been filled in and scanned
- to interpret a page from a book
Though all major Indian languages are objects of research to begin with Devanagari, Tamil and Telugu will be addressed to. These offer considerable variety in terms of visual complexity (and hence the challenge for recognition). Other scripts will be taken up in due course of time.
- Development of linguistic resources in selected scripts
The working group will drive the creation of significant linguistic resources for the tasks and scripts outlined above.
Some examples of linguistic resources are:
- Online handwritten word samples from at least 500 writers in each script.
- Samples of handwritten characters extracted from forms representing at least 500 writers, and at least 500 samples of each handwritten character in each script.
- Synthetic data covering all printed characters and at least 1000 pages in each script
Phase II (year 4-8)
Since standardization requires consensus among creators and users of linguistic resources, it is expected that the process of standardization would continue as an activity beyond the first three years.
The tools created in the first phase will be continuously refined during this second phase, as more and more researchers start to use them and provide feedback and suggestions for improvement.
- Extension of technology tasks and linguistic resources to remaining scripts
The technologies developed for the initial set of scripts will be adapted for other scripts during the second phase. As in the first phase, technology development will be supported by the creation of linguistic resources to support the technology development in other scripts, subject to budget constraints and interest from researchers working on those scripts.
- Promotion of significant applications
A major activity during the second phase will be the promotion of significant applications with high potential impact on society. These will typically involve solving of challenging problems, multiple years of concerted effort, and close interaction between participating institutions and other researchers in India and abroad.
It is envisaged that these applications will be developed for selected languages and scripts such as Hindi, and the same will be extended to other languages and scripts with participation from researchers from all over India in due course of time.
The list is meant to be indicative rather than exhaustive.
Handwriting Interface to Computers
Indian scripts are complex and not suitable for keyboard-based entry. Replacing the keyboard with a simpler and more natural interface based on handwriting would make computers much more accessible to the common man and to educators in particular.
Imagine that the keyboard is replaced with a special writing pad for handwriting input. As one writes, the writing is converted using HWR technology into words and entered into the target application. The solution would also need to support numerals, punctuation, and editing gestures, and functionally replace the keyboard.
Language technologies used: Online HWR for Indian Languages
The solution described above can also be adapted to provide computer-based instruction in handwriting to improve writing skills of school children, improve literacy as part of adult education programs, or allow literate adults to learn new scripts.
Language technologies used: Online HWR for Indian Languages
Multilingual Digital Libraries for Education
A wealth of literature and other education material in Indian languages is trapped in books, which require storage and are subject to physical decay. Online books on the other hand have no such problems, and may be made available to students all over in their schools, homes or hostels over the Internet.
The proposed solution will use a complete OCR pipeline for converting scanned images of book pages into electronic form, which will then be used to create a multilingual digital library. The library can then be searched using the local language, using either spoken (using Speech Recognition) or written (using Online HWR) queries. Results can be viewed on screen, and also read out using Text-to-Speech conversion. In addition, an annotation system will allow students to make private annotations on the book. This solution can be used by individual libraries or to create district, state or national level online educational resources.
Language technologies used: OCR, Online HWR, Speech Recognition, Text-to-Speech for Indian Languages.
Automatic Forms Processing/Educational Testing
With millions of application forms filled in every year in Indian languages especially in the education sector, a solution for automatically reading handwritten entries from scanned images of forms is clearly very valuable. As a result of a growing school-going population, manual evaluation of answer papers has become very difficult. By using Offline HWR technology, there is the possibility of automatically reading and evaluating responses (for at least the fill-in-the-blanks style of questions where there is one (or a few) correct answers).
The proposed solution is a complete forms-processing system that can be used to read handwriting from a scanned image of a paper form. The interpreted results can be stored into a database (for applications) or compared with correct responses (for educational testing).
Language technologies used: Offline HWR for Indian Languages