LDC-IL

Current Status

POS Tagged Corpus

We have developed Automatic POS Tagger for Indian Languages using hybrid approach. The precision at present is 86.2% (LDC-IL Tagset 84.2%, BIS Tagset 88.2%) but it is expected to go higher after more rounds of fine tuning.

The following table shows the number of words annotated as per the LDC-IL POS tagset.

Words tagged as per LDC-IL POS tagset
Sl.No.	Language	2008-09	2009-10	2010-11	Total Words tagged
1	Assamese	Tag set creation	30,000 +	~ 50,000	85390
2	Bengali	Tag set creation	25,000 +	~ 50,000	75397
3	Bodo	Tag set creation	30,000 +	~ 50,000	83453
4	Gujarati	Tag set creation	30,000 +	~ 50,000	83435
5	Hindi	Tag set creation	30,000 +	~ 50,000	84962
6	Malayalam	Tag set creation	30,000 +	~ 50,000	82897
7	Manipuri	Tag set creation	30,000 +	~ 50,000	83439
8	Nepali	Tag set creation	29,000 +	~ 50,000	86616
9	Oriya	Tag set creation	30,000 +	~ 50,000	79159
10	Punjabi	Tag set creation	28,000 +	~ 50,000	78053
11	Tamil	Tag set creation	30,000 +	~ 50,000	88086
12	Urdu	Tag set creation	26,000 +	~ 50,000	76996

The following table shows the number of words annotated as per the BIS POS tags

Words tagged as per Bureau Of Indian Standard (BIS) POS tagset
S.No	Language	Previous Data (before 31.3.2013)	Current Data ( 1.4.2013 onwards)	Total
1	Assamese	55066	52185	107251
2	Bengali	212426	67450	279876
3	Bodo	133887	115136	249023
4	Gujarati	265789	324305	590094
5	Hindi	233347	410406	643753
6	Kannada	228181	289229	517410
7	Kashmiri	99906	0	99906
8	Konkani	0	106016	106016
9	Maithili	62750	172347	235097
10	Malayalam	392689	762870	1155559
11	Manipuri	102116	284724	386840
12	Nepali	0	189159	189159
13	Odia	159474	341583	501057
14	Punjabi	251250	680994	932244
15	Tamil	174488	1202369	1376857
16	Urdu	184791	480461	665252

Back

Top

Words tagged as per LDC-IL LEX tag set
Sl. No.	Languages	Words
Major Scheduled languages
1	Assamese	7,253
2	Bengali	8,338
3	Gujarati	11,552
4	Hindi	3,973
5	Malayalam	6,088
6	Marathi	7,645
7	Nepali	7,719
8	Oriya	5,783
9	Punjabi	7,818
10	Tamil	15,742
11	Urdu	4,955
Minor Scheduled Languages
12	Manipuri	2,754
13	Dogri	9,257
14	Konkani	9,190
15	Bodo	6,822

Back

Top

POS & Chunking
Task	Languages
Prepared LDC-IL 0.4 version tagset	Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu
Prepared Chunk tagset	Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu
Done pilot chunking for 200 sentences	Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu

Back

Top

INDIAN SIGN LANGUAGE (ISL)

The ISL corpus has been collected at the LDC-IL in the recording studio of CIIL. Segmentation and annotation of this corpus is presently going on. This corpus consists of the following categories:

S.No	Category
1	Short stories : Thirsty crow, rabbit and tortoise
2	Frequent words
3	Question and answering

The ISL corpus has also been collected by the RKMVU, Coimbatore. It consists of basic vocabulary, self introduction, information regarding family members, friends, activities/hobbies, food habit, travel etc.