Some of the important language data resources required in Indian languages for various NLP applications are given below:
2. Electronic dictionaries
Electronic dictionaries are a primary requisite for developing any software in NLP.
ED 1 Monolingual/bilingual dictionaries 25,000 words per year (per language)
ED 2 Transfer Lexicon and Grammar (TransLexGram) (per language)
Transfer Lexicon and Grammar above involves developing a language resource which would contain
- English Headwords
- Their grammatical category
- Their various senses in Hindi
- Corresponding sense in the other Indian language
- An example sentence in English for each sense of a word
- Corresponding translation in the concerned Indian language
- In case of verbs, parallel verb-frames from English to Indian language.
As is obvious from the above, TransLexGram will be a rich lexicon which will not only contain the word level information but also the crucial information of verb-argument structure and the vibhaktis which various languages use with specific senses of a verb.
The resource, once created will be a parallel resource not only between English and Indian languages but also across all Indian languages.
If the bilingual TransLexGram are created as aligned resources, there would be several advantages which will accrue. It will also reduce the work to be done for each individual resource.