POS Tagged Corpus
We have developed Automatic POS Tagger for Indian Languages using hybrid approach. The precision at present is 86.2% (LDC-IL Tagset 84.2%, BIS Tagset 88.2%) but it is expected to go higher after more rounds of fine tuning.

The following table shows the number of words annotated as per the LDC-IL POS tagset.
Words tagged as per LDC-IL POS tagset |
Sl.No. |
Language |
2008-09 |
2009-10 |
2010-11 |
Total Words tagged |
1 |
Assamese |
Tag set creation |
30,000 + |
~ 50,000 |
85390 |
2 |
Bengali |
Tag set creation |
25,000 + |
~ 50,000 |
75397 |
3 |
Bodo |
Tag set creation |
30,000 + |
~ 50,000 |
83453 |
4 |
Gujarati |
Tag set creation |
30,000 + |
~ 50,000 |
83435 |
5 |
Hindi |
Tag set creation |
30,000 + |
~ 50,000 |
84962 |
6 |
Malayalam |
Tag set creation |
30,000 + |
~ 50,000 |
82897 |
7 |
Manipuri |
Tag set creation |
30,000 + |
~ 50,000 |
83439 |
8 |
Nepali |
Tag set creation |
29,000 + |
~ 50,000 |
86616 |
9 |
Oriya |
Tag set creation |
30,000 + |
~ 50,000 |
79159 |
10 |
Punjabi |
Tag set creation |
28,000 + |
~ 50,000 |
78053 |
11 |
Tamil |
Tag set creation |
30,000 + |
~ 50,000 |
88086 |
12 |
Urdu |
Tag set creation |
26,000 + |
~ 50,000 |
76996 |
The following table shows the number of words annotated as per the BIS POS tags
Words tagged as per Bureau Of Indian Standard (BIS) POS tagset |
Sl.No. |
Language |
2012-13 |
|
|
Validated |
To be validated |
Total Words tagged |
1 |
Assamese |
27810 |
- |
~27,000 |
2 |
Bengali |
133512 |
107229 |
~2,40,000 |
3 |
Bodo |
112891 |
127230 |
~2,40,000 |
4 |
Gujarati |
169866 |
54116 |
~2,23,000 |
5 |
Hindi |
103194 |
124786 |
~2,27,000 |
6 |
Kannada |
101324 |
123608 |
~2,24,000 |
7 |
Maithili |
55661 |
|
~55,000 |
8 |
Malayalam |
129717 |
221553 |
~3,51,000 |
9 |
Manipuri |
101782 |
- |
~1,01,000 |
10 |
Odia |
78931 |
29380 |
~1,08,000 |
11 |
Punjabi |
132577 |
112059 |
~2,44,000 |
12 |
Tamil |
104573 |
111937 |
~2,16,000 |
13 |
Telugu |
38045 |
- |
~38,000 |
14 |
Urdu |
77329 |
121455 |
~1,98,000 |
INDIAN SIGN LANGUAGE (ISL)
The ISL corpus has been collected at the LDC-IL in the recording studio of CIIL. Segmentation and annotation of this corpus is presently going on. This corpus consists of the following categories:
S.No |
Category |
1 |
Short stories : Thirsty crow, rabbit and tortoise |
2 |
Frequent words |
3 |
Question and answering |
The ISL corpus has also been collected by the RKMVU, Coimbatore. It consists of basic vocabulary, self introduction, information regarding family members, friends, activities/hobbies, food habit, travel etc.
|