LDC-IL

Corpora Creation in Indian Languages

1. Introduction

The Central Institute of Indian Languages has a corpora of around 3.5 million words in many major Indian languages. The same will be enlarged to the extent of 25 million words in each language. Also, the existing corpora is raw corpora and it will be cleaned for use. Apart from 22 major Indian languages there are hundreds of minor and tribal languages that deserve attention from the researchers for their analysis and interpretation. Creation of corpora in these languages will help in comparing and contrasting structure and functioning of Indian languages. So, at least 100 minor languages corpora will be collected to a tune of around 3 to 5 million words in each language depending upon availability of text for the purpose.

2. Domain Specific Corpora

Apart from these basic text corpora creation an attempt will be made to create domain specific corpora in the following areas:

1. Newspaper corpora

2. Child language corpus

3. Pathological speech/language data

4. Speech error Data

5. Historical/Inscriptional databases of Indian languages which is one of the most important to trace not only as the living documents of Indian History but also historical linguistics of Indian languages.

6. Grammars of comparative/descriptive/reference are needed to be considered as corpus of databases.

7. Morphological Analyzers and morphological generators.

Top

3. POS tagged corpora

Part-of-speech (or POS) tagged corpora are collections of texts in which part of speech category for each word is marked.

POS tagged corpora will be developed in a bootstrapping manner. As a first step, manual tagging will be done on some amount of text. A POS tagger which uses learning techniques will be used to learn from the tagged data. After the training, the tool will automatically tag another set of the raw corpus. Automatically tagged corpus will then be manually validated which will be used as additional training data for enhancing the performance of the tool. This process will be repeated till the accuracy of the tool reaches a satisfactory level. With this approach, the initial man hours per 10,000 words will be more. Thereafter, the tagging process will speed up.

4. Chunked corpora

The chunked corpora will also be prepared in a manner similar to the POS tagging. Here also the initial training set will be a complete manual effort. Thereafter, it will be a man-machine effort. That is why, the target in the first year is less and double in the successive years. Chunked corpora is a useful resource for various applications.

5. Semantically tagged corpora

The real challenge in any NLP and text information processing application is the task of disambiguating senses. In spite of long years of R & D in this area, fully automatic WSD with 100% accuracy has remained an elusive goal. One of the reasons for this shortcoming is understood to be the lack of appropriate and adequate lexical resources and tools. One such resource is the “semantically tagged corpora”.

In semantically tagged corpora, words in the text documents will be marked with their correct senses. For example, in
“Can a can can soup”

Apart from POS tagging, it is also necessary to tag the text as
“Can a can <included-in-set: container> can
<included-in-set:hold-action>” soup”

Which is a example of semantic tagging.

Top

The question that arises is “What should be the set of such tags and where should these come from?” A natural answer to this is obtained when we look at the “WordNet”. WordNet is a semantic structure where “relational semantics” is exploited to encode the senses of words. The basic machinery for sense representation is the accumulation of synonyms into ‘synsets’ and also enumerating the semantic relations like ‘hypernyms’, ‘meronyms’ etc. For example, the ‘included-in-set’ tag above is the hypernmy (super ordinate) relation which disambiguates the sense.

Following are the steps towards creating semantically tagged corpora:

Develop, refine and make widely available Indian language WordNets. (IITB is developing Hindi and Marathi WordNets; AU-KBC and Tanjavur University are working on Tamil WordNets. Similarly other language WordNets are being created at a other places.)

Link the WordNets into the “Indo-WordNet”- a massive semantic structure of Indian language WordNets.

Link the Indo WordNet to English and Euro-WordNets.

Create large amounts of sense tagged corpora manually for the purpose of training a ‘sense tagging machine’. The tags are the INDO-WORDNET SYNSET NUMBERS.

Devise algorithms for the training task. Hidden Markov Model, Entropy maximization etc. are the possible candidates.

For the purpose of semi-automatic semantic tagging, invest on user friendly and intelligent user interfaces.

The semantically tagged corpora is a valuable resource which will be constructed using the Indian language WordNets and then employing machine learning algorithms (as in the case of POS taggers discussed above).

Top

6. Syntactic tree bank

Preparation of this resource requires higher level of linguistic expertise and needs more human effort. For preparing this corpora experts will manually tag the data for syntactic parsing. A tool can then automatically extract various tree structures for the tree bank. Since it requires more manual effort and also a higher degree of linguistic expertise, building of this resource will be a relatively slower process. The initial take-off time will also be more in this case.

Since, a crucial point related to this task is to arrive at a consensus regarding the tags, degree of fineness in analysis and the methodology to be followed. This calls for some discussions amongst the scholars from varying fields such as linguistics and computer science. It will be achieved through conduct of workshops and meetings. First some Sanskrit scholars, linguists and computer scientists will review the existing tagging scheme developed for Indian languages by IIIT, Hyderabad and define standards for all Indian languages (extendable to any language). On this basis some experiments will be carried out on the selected Indian languages to test the applicability and quality of the defined standards. After testing these actual tagging task will start.

7. Parallel aligned corpora

A text available in multiple languages through translation constitutes parallel corpora. The National Book Trust, Sahitya Akademi are some of the official agencies who develop parallel texts in different languages through translation. Such Institutions have given permission to the Central Institute of Indian Languages to use their works for creation of electronic versions of the same as parallel corpora. The magazines, news paper houses who bring out translated versions of their output are another source to provide texts for parallel corpora. First wherever necessary the text have to be keyed in and then computer programmes have to be written for creating.

[I] Aligned texts; [II] Aligned sentences; and [III] Aligned chunks.

Top

8. Tools

Tools for Transfer Lexicon Grammar (including creation of interface for building Transfer Lexicon Grammar).

Spellchecker and corrector tools.

Tools for POS tagging. (Trainable tagging tool with an Interface for editing POS tagged corpora).

Tools for chunking (Rule-based language-independent chunkers).

Interface for chunking (Building an interface for editing and validating the chunked corpora).

Tools for syntactic tree bank, incl. interface for developing syntactic tree bank.

Tools for semantic tagging with basic resources are the Indian language WordNets showing a browser that has two windows - one showing the senses (i.e., synsets) from the WordNet appear in the other window, after which a manual selection of the sense can be done.

(Semi) automatic tagger based on statistical NLP (the preliminary version of which is ready in IITB).

Tools for text alignment, including Text alignment tool, Sentence alignment tool and Chunk alignment tool as well as an interface for aligning corpora.