Central Institute of Indian Languages [CIIL] MISSION STATEMENT:  Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition.  Our Other Sites  Related Sites 
You are here: BACK
Resources > Speech Corpora
Size of Speech Corpora ( As on Dec 2011)

SPEECH CORPORA (Raw Data)

 

 

 

Sl No.

Languages

Hours

1

Assamese

105:52:37

2

Bengali

138:18:47

3

Bodo

114:38:55

4

Dogri

58:12:49

5

Gujarati

146:23:04

6

Hindi

163:25:47

7

Indian English Bengali

34:12:57

8

Indian English Guajarati (MP3 Format)

21:40:00

9

Indian English Kannada

37:01:33

10

Kannada

137:53:28

11

Kashmiri

44:59:07

12

Konkani

205:01:48

13

Maithili

43:33:42

14

Malayalam

105:47:05

15

Manipuri

107:10:30

16

Marathi

168:13:50

17

Nepali

145:04:46

18

Oriya

45:10:25

19

Punjabi

71:55:56

20

Tamil

87:03:24

21

Telugu

50:51:36

22

Urdu

81:06:25


Back Top

SPEECH CORPORA (Segmented Data)


LANGUAGE

DIALECTS

NO. OF FEMALE SPEAKERS

NO. OF MALE SPEAKERS

TOTAL NO. OF SPEAKERS

TOTAL SPEECH DATA (HOURS)

Assamese

Upper Assam, Lower Assam

154

152

306

80:08:04

Bengali

SCB (Kolkata) & Barendri (North Bengal)

231

238

469

125:19:53

Bodo

Standard and Non Standard

71

75

146

07:46:56

Indian English Bengali

Indian

27

26

53

26:56:45

Indian English Kannada

Indian

27

26

53

16:52:24

Gujarati

Standard And South Gujarati

27

38

65

06:01:26

Hindi

Standard, Bhojpuri & Magahi

206

227

433

105:26:45

Kannada

North-East, North-west and Canara

246

246

492

137:10:37

Konkani

Standard

54

53

107

43:01:36

Maithili

Standard

72

74

146

02:06:24

Malayalam

Standard

81

80

161

63:56:45

Manipuri

Standard and Kakching

115

112

227

36:33:28

Marathi

Standard

75

75

150

58:57:50

Nepali

Darjeeling and Assamese

99

97

196

44:48:43

Oriya

Standard

80

82

162

37:38:48

Punjabi

Standard

78

78

156

29:38:25

Tamil

Standard

64

86

150

74:11:58

Telugu

Standard

13

43

56

01:06:41

Urdu

Standard

85

84

169

40:01:04


Back Top

SPEECH CORPORA (Annotated Data)

Sl. No.

Name of the Language

Validated Speech Annotated Data
(HH:MM:SS)

1.

Bengali

04:33:37

2.

Hindi

01:01:28

3.

Konkani

02:25:00

4.

Kannada

01:00:00

5.

Oriya

00:58:28

6.

Malayalam

01:00:00

7.

Punjabi

04:07:26

8.

Tamil

01:00:00

TOP BACK
You are visitor No.
WAIT...

Developed & Maintained by:
LDC-IL, CIIL
Copyright © LDC-IL,
Central Institute of Indian Languages
Central Institute of Indian Languages
Department of Higher Education
Ministry of Human Resource Development
Government of India
Manasagangothri, Hunsur Road, Mysore-570006, Karnataka, India.
Tel: (0821) 2515820 (Director)
Reception/PABX : (0821) 2345000
Fax: (0821) 2515032 (Off)
        Home | Announcements | News | CIIL | Contact Us