Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition.  Our Other Sites  Related Sites 
You are here: BACK
Current Status
Current Status

POS Tagged Corpus

We have developed Automatic POS Tagger for Indian Languages using hybrid approach. The precision at present is 86.2% (LDC-IL Tagset 84.2%, BIS Tagset 88.2%) but it is expected to go higher after more rounds of fine tuning.

 The following table shows the number of words annotated as per the LDC-IL POS tagset.

Words tagged as per LDC-IL POS tagset

Sl.No.

Language

2008-09

2009-10

2010-11

Total Words  tagged

1

Assamese

Tag set creation

30,000 +

~ 50,000

85390

2

Bengali

Tag set creation

25,000 +

~ 50,000

75397

3

Bodo

Tag set creation

30,000 +

~ 50,000

83453

4

Gujarati

Tag set creation

30,000 +

~ 50,000

83435

5

Hindi

Tag set creation

30,000 +

~ 50,000

84962

6

Malayalam

Tag set creation

30,000 +

~ 50,000

82897

7

Manipuri

Tag set creation

30,000 +

~ 50,000

83439

8

Nepali

Tag set creation

29,000 +

~ 50,000

86616

9

Oriya

Tag set creation

30,000 +

~ 50,000

79159

10

Punjabi

Tag set creation

28,000 +

~ 50,000

78053

11

Tamil

Tag set creation

30,000 +

~ 50,000

88086

12

Urdu

Tag set creation

26,000 +

~ 50,000

76996


The following table shows the number of words annotated as per the BIS POS tags

Words tagged as per Bureau Of Indian Standard (BIS) POS tagset

Sl.No.

Language

2012-13

Validated

To be validated

Total Words  tagged

1

Assamese

27810

-

~27,000

2

Bengali

133512

107229

~2,40,000

3

Bodo

112891

127230

~2,40,000

4

Gujarati

169866

54116

~2,23,000

5

Hindi

103194

124786

~2,27,000

6

Kannada

101324

123608

~2,24,000

7

Maithili

55661

 

~55,000

8

Malayalam

129717

221553

~3,51,000

9

Manipuri

101782

-

~1,01,000

10

Odia

78931

29380

~1,08,000

11

Punjabi

132577

112059

~2,44,000

12

Tamil

104573

111937

~2,16,000

13

Telugu

38045

-

~38,000

14

Urdu

77329

121455

~1,98,000


Back Top

INDIAN SIGN LANGUAGE (ISL)

The ISL corpus has been collected at the LDC-IL in the recording studio of CIIL.  Segmentation and annotation of this corpus is presently going on. This corpus consists of the following categories:

S.No

Category

1

Short stories : Thirsty crow, rabbit and tortoise

2

Frequent words

3

Question and answering

The ISL corpus has also been collected by the RKMVU, Coimbatore. It consists of basic vocabulary, self introduction, information regarding family members, friends, activities/hobbies, food habit, travel etc.


TOP BACK
You are visitor No.
WAIT...

Developed & Maintained by:
LDC-IL, CIIL
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us