Introduction
Some of the important language data resources required in Indian languages for various NLP applications are given below:
1. Parts-of-Speech Tagging:
It is the process of assigning a word in a text as corresponding to a particular
part of speech on the basis of its definition and its occurrence in a given context.
The process is basically to design or provide help in creation of appropriate language technology.
Since each PoS tag is attached to a single word, preprocessing mechanisms such as splitting, tokenization, etc.
have already been performed to filter out typesetting based-raw corpus. This is in response to meet
the requirement of standardization amongst the Indian languages that exhibit a very rich system of
morphology where words appear long with complex morpho-phonemic and morpho-syntactic changes at the junctures.
Coverage of Languages:
The priority is to cover all the Scheduled languages and then take up other non-scheduled
languages. The third phase-based work on PoS tagging includes 22 Scheduled Languages such as Assamese,
Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi,
Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu.
PoS Tagging Guidelines:
In order to develop various TagSets for individual languages, the scheme has undertaken
certain linguistic modus operandi as laid down below:
Defining the traditional parts of speech along with the examples
Understanding the concept of Form and Function (Pronouns, Demonstratives,
Numerals, etc.)
Recognizing the fuzzy boundaries between the grammatical classes, i.e., a lexical
item may function as a specific category and the same may function as different category in different
context (Gerunds vs. Infinitive/Participle etc).
Working out the syntactic relation between the modifier-modified (Adj-Noun;
Participle-Noun).
Realizing the morpho-syntactic features a particular lexical item carries
in a given syntactic configuration. (Person-Number-Gender/Case; Tense-Aspect-Mood/Mod).
2. Chunking:
The process of annotating tagged tokens with structures in a non-hierarchical
and non-recursive way is Chunking. It is acknowledged that segmentation and labeling are the most common
operations in language processing. Chunking is a popular representative of a segmentation process aiming to
segment the tagged tokens into meaningful structures. In the meantime, chunkers generally do not try to analyze
entire sentences, but only try to build “chunks” of words. In this line of view, the rule system of chunkers is
relatively simple, robust, and efficient.
Chunking Guidelines:
The scheme has adopted certain set of linguistic norms which should be followed by the Resource
Persons working on chunking. The chunking of linguistic expression is purely based on specific categorial label and
hence the following linguistic guides are being introduced for the ease of annotators.
Identifying different chunk levels along with the typical examples.
Keeping in mind that minimal recursive phrases (nominal or verbal) should be captured.
Understanding the idea that chunking operates on the minimal non-recursive phrases and within
such minimal construction, there is no nested structure.
Make sure that nested non-recursive clusters are identified with their heads.
(Possessive Constructions, Spatial Relational Nouns, Nested Modifier inside noun phrases).
Having the knowledge as well as hands-on experience of linguistic phenomena such as
scrambling of the lexical items, dislocated element, spelling out of boundary elements realized as case markers
or tense, mood, aspects etc, between two expressions that operate on the data of the language concerned.
PoS/Chunk Sets:
With the adequate information sketched above, the scheme has developed
PoS as well as Chunk Sets for all Indian languages based on which concerned resource persons
maintain their academic venture.
|