Skip to main content | Skip to Navigation | Text Size : | Language:

logo of Linguistic Data Consortium for Indian Languages (LDC-IL)
Corpora and Technological Resources for Assamese | Official Website of Linguistic Data Consortium for Indian Languages

Corpora and Technological Resources for Assamese

Maintained by : Syeda Mustafiza Tamim
Last Updated on: 17/01/2025


Introduction

Language resources are essential for effective language study and analysis. Developing high-quality data requires significant time and effort. Innovations like writing systems, the printing press, and computers have aided in storing language information. A comprehensive listing of authentic language data and tools will be beneficial for advancing language studies. These resources are crucial for understanding the evolution and growth of a particular language, especially in enhancing research and improving language technology.

Resource Development Challenges

Language technology presents a profitable and socially beneficial solution to overcome language barriers. However, the digitization of language data poses challenges, particularly in developing accurate language models for complex languages like Assamese. The first step in enabling computers to understand human language is encoding, achieved through the Unicode scheme, which assigns unique codes for Assamese characters.

Language Resources

Numerous materials exist to analyze Assamese data and develop technologies. These resources include text and speech corpora, dictionaries, ontologies, and multimedia databases, alongside software for collection, preparation, and analysis. A linguistic corpus, which represents real-time language usage, is vital for developing various language technologies.

The main application areas of language technology include spell and grammar checking, speech recognition and synthesis, machine translation, and information retrieval. Language resources are foundational for these tools, enhancing communication and interaction between humans and computers.

Prominent Institutions and Initiatives

Linguistic Data Consortium for Indian Languages (LDC-IL)

LDC-IL, housed within the Central Institute of Indian Languages, has developed a comprehensive Assamese text and speech corpus from various sources, including books and newspapers. The data covers multiple domains, with significant contributions to language processing.

AI4Bharat

AI4Bharat, a research lab at IIT Madras, is committed to enhancing AI technology for Indian languages through open-source initiatives. The lab has developed and released an extensive set of datasets, tools, and cutting-edge models. Its focus areas are transliteration, natural language understanding and generation, translation, automatic speech recognition, and speech synthesis.

Indian Languages Corpora Initiative (ILCI)

The Indian Languages Corpora Initiative (ILCI), launched by TDIL, represents a significant effort to create national corpora based on standardized frameworks. In Phase 1, the initiative successfully developed parallel annotated corpora in 12 major Indian languages, including English, utilizing India's national standards for part-of-speech (POS) annotation. Following Phase 2, the total size of the corpora is now estimated to be around 27 million parallel annotated and chunked words, covering key domains such as: Health and Tourism (HT), Agriculture and Entertainment (AGENT)

Technology Development for Indian Languages (TDIL)

Initiated by the Ministry of Electronics & Information Technology, TDIL focuses on developing multilingual knowledge resources and language technology. This includes balanced corpora and machine translation systems, benefiting numerous Indian languages, including Assamese.

OPUS

A growing collection of translated texts, aligned as a parallel corpus for easy access.

Open SLR

Provides high-quality Assamese multi-speaker speech datasets, facilitating language research.

Gauhati University

Assamese Corpus was developed in the NLP Lab of Gauhati University. Total size of Assamese Corpus (in terms of words) is 1.6 million (1613551 words). The Corpus is prepared following the guidelines of Corpus Encoding Standard and is UNICODE encoded.

Kaggle

A platform for sharing and discovering datasets, including Assamese speech datasets.

TC-11Online Resources

Provides handwritten data from native Assamese writers.

FutureBeeAI

provides 2000+ datasets for AI development process.

Common Voice

Common Voice is a publicly accessible voice dataset. Its goal is to create an open-source, multilingual collection of voices that can be used by anyone to train speech-enabled applications.

ULCA

Universal Language Contribution APIs (ULCA) is an open-source, scalable data platform that supports a variety of datasets for Indic languages. It also provides a user-friendly interface for interacting with these datasets. For more information Click here



The resources available for Assamese

Resource Resource Centre Specification Source
Text corpus LDC-IL 1,01,27,030 words View Resource
sketchengine 2.5 million words View Resource
Metatext 7.6 million words words View Resource
b2find 1613551 words View Resource
AI4BHARAT View Resource
Parallel text corpus BHASHINI ULCA 1680416/td> View Resource
Monolingual Chunked Text Corpus TDIL 392700 words View Resource
Synset MeitY Approx 14,958 synsets View Resource
Speech corpus ASR LDC-IL 54:21:12 Hours View Resource
AI4Bharat View Resource
EkStep-Unlabelled View Resource
EkStep-Labelled View Resource
Common Voice by Mozilla Not Available View Resource
ACM Digital Librarary 20 hours View Resource
Assamese Speech Data-ASR 50:11:44 Hours View Resource
Speech corpus TTS AI4Bharat 53:06:00 hours View Resource

Conclusion

The development of language resources for Assamese is a collaborative effort involving various institutions and initiatives. These resources not only support the advancement of language technology but also preserve and promote the rich linguistic heritage of Assamese.