Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition.  Our Other Sites  Related Sites 
You are here: BACK
Resources > Speech Corpora
Size of Speech Corpora ( As on Aug 2011)

SPEECH CORPORA (Raw Data)

 

 

 

 

Sl No.

Languages

Speakers

Hours

1

Assamese

456

105:51:38

2

Bengali

472

138:18:47

3

Bodo

433

201:10:48

4

Dogri

154

111:32:11

5

Gujarati

450

156:23:04

6

Hindi

450

163:25:47

7

Indian English Bengali

52

34:12:57

8

Indian English Guajarati (MP3 Format)

52

21:40:00

9

Indian English Kannada

54

37:01:33

10

Kannada

492

143:28:54

11

Kashmiri

150

44:59:07

12

Konkani

455

195:14:47

13

Maithili

156

43:33:42

14

Malayalam

314

105:47:05

15

Manipuri

457

107:10:27

16

Marathi

306

168:13:50

17

Nepali

485

145:04:46

18

Oriya

462

165:30:05

19

Punjabi

468

 110:48:26

20

Tamil

453

213:37:27

21

Telugu

156

50:51:36

22

Urdu

480

124:19:58


Back Top

SPEECH CORPORA (Segmented Data)


LANGUAGE

DIALECTS

NO. OF FEMALE SPEAKERS

NO. OF MALE SPEAKERS

TOTAL NO. OF SPEAKERS

SIZE OF SPEECH DATA-FEMALE (HOURS)

SIZE OF SPEECH DATA-MALE (HOURS)

TOTAL SPEECH DATA (HOURS)

Assamese

Upper Assam, Lower Assam

154

152

306

08:23:43

17:35:43

25:59:26

Bengali

SCB (Kolkata) & Barendri (North Bengal)

231

238

469

27:28:16

29:21:12

56:49:28

Bodo

Standard and Non Standard

71

75

146

02:28:10

05:18:46

07:46:56

English Bengali

Standard And South Gujarati

27

26

53

 

 

 

English Kannada

AVADHI, BHOJPURI, MAGAHI
and STANDARD

26

26

52

 

 

 

Gujarati

Indian

27

38

65

02:07:27

03:53:59

06:01:26

Gujarati Mono

Indian

125

110

235

00:25:33

05:13:53

05:39:26

Hindi

Standard, Bhojpuri & Magahi

207

226

433

17:30:17

20:22:55

37:53:12

Kannada

North-East, North-west and Canara

246

246

492

04:03:12

05:11:27

57:14:39

Konkani

Standard

57

61

118

06:48:41

07:50:48

14:39:29

Maithili

Standard

72

74

146

00:10:25

01:55:59

02:06:24

Malayalam

Standard

151

150

301

11:07:49

22:16:00

33:23:49

Manipuri

Standard and Kakching

229

221

450

03:28:31

07:21:29

10:50:00

Marathi

Standard

75

75

150

07:12:09

07:09:29

14:21:38

Nepali

Darjeeling and Assamese

99

97

196

09:40:49

09:17:42

18:58:31

Oriya

Standard

169

171

340

09:11:12

11:24:43

20:35:55

Punjabi

Standard

78

78

156

05:00:24

06:10:44

11:11:08

Tamil

Standard

64

86

150

15:08:57

17:27:55

32:36:52

Telugu

Standard

13

43

56

00:13:28

00:53:13

01:06:41

Urdu

Standard

169

168

337

21:34:54

29:57:05

51:31:59


Back Top

SPEECH CORPORA (Annotated Data)

Sl. No.

Name of the Language

Validated Speech Annotated Data

1.

Bengali

04:33:37

2.

Hindi

01:01:28

3.

Konkani

02:25:00

4.

Kannada

01:00:00

5.

Oriya

00:58:28

6.

Malayalam

01:00:00

7.

Punjabi

04:07:26

8.

Tamil

01:00:00

TOP BACK
You are visitor No.
WAIT...

Developed & Maintained by:
LDC-IL, CIIL
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us