Skip to main content | Skip to Navigation | Text Size : | Language : Other languages :

logo of Linguistic Data Consortium for Indian Languages (LDC-IL)
भारतीय भाषाओं के लिए भाषाई डेटा कंसोर्टियम (एलडीसी-आईएल)
Linguistic Data Consortium for Indian Languages (LDC-IL)

शिक्षा मंत्रालय, भारत सरकार
Ministry of Education, Government of India

Released Datasets | LDC-IL

Released Datasets of LDC-IL and their Prices

LDC-IL has so far released a total of 58+ datasets. The list of the datasets released is given below along with their prices for the commercial users.

Sl no. Name of datasets Link Prices
1 Assamese Sentence Aligned Speech Corpus 217004
2 Bengali Sentence Aligned Speech Corpus 437866
3 Hindi Sentence Aligned Speech Corpus 464357
4 Kannada Sentence Aligned Speech Corpus 697297
5 Konkani Sentence Aligned Speech Corpus 546368
6 Maithili Sentence Aligned Speech Corpus 279020
7 Malayalam Sentence Aligned Speech Corpus 816153
8 Marathi Sentence Aligned Speech Corpus 265318
9 Nepali Sentence Aligned Speech Corpus 298711
10 Odia Sentence Aligned Speech Corpus 441013
11 Tamil Sentence Aligned Speech Corpus 538655
12 Urdu Sentence Aligned Speech Corpus 328755
13 Indian English-Bengali variant Sentence Aligned Speech Corpus 48555
14 Indian English-Kannada variant Sentence Aligned Speech Corpus 61959
15 Chhattisgarhi Raw Speech Corpus 375592

These datasets are distributed for both commercial and non-commercial usage.

Please note that for bonafide non-commercial and academic use, the datasets are free of charge. The requester needs to be a bonafide student/faculty/employee of a government funded research Institute or be a government entity.

Additional discounts are available for Startups, MSMEs, entitites from the SAARC countries. For more details about the discount and the procedure to procure the datasets, please login to the Data Distribution portal and see the FAQ page.