Corpora and Technological Resources for Assamese

Maintained by : Syeda Mustafiza Tamim
Last Updated on: 17/01/2025

Introduction

Language resources are essential for effective language study and analysis. Developing high-quality data requires significant time and effort. Innovations like writing systems, the printing press, and computers have aided in storing language information. A comprehensive listing of authentic language data and tools will be beneficial for advancing language studies. These resources are crucial for understanding the evolution and growth of a particular language, especially in enhancing research and improving language technology.

Resource Development Challenges

Language technology presents a profitable and socially beneficial solution to overcome language barriers. However, the digitization of language data poses challenges, particularly in developing accurate language models for complex languages like Assamese. The first step in enabling computers to understand human language is encoding, achieved through the Unicode scheme, which assigns unique codes for Assamese characters.

Language Resources

Numerous materials exist to analyze Assamese data and develop technologies. These resources include text and speech corpora, dictionaries, ontologies, and multimedia databases, alongside software for collection, preparation, and analysis. A linguistic corpus, which represents real-time language usage, is vital for developing various language technologies.

The main application areas of language technology include spell and grammar checking, speech recognition and synthesis, machine translation, and information retrieval. Language resources are foundational for these tools, enhancing communication and interaction between humans and computers.

Prominent Institutions and Initiatives

Linguistic Data Consortium for Indian Languages (LDC-IL)

LDC-IL, housed within the Central Institute of Indian Languages, has developed a comprehensive Assamese text and speech corpus from various sources, including books and newspapers. The data covers multiple domains, with significant contributions to language processing.

AI4Bharat

AI4Bharat, a research lab at IIT Madras, is committed to enhancing AI technology for Indian languages through open-source initiatives. The lab has developed and released an extensive set of datasets, tools, and cutting-edge models. Its focus areas are transliteration, natural language understanding and generation, translation, automatic speech recognition, and speech synthesis.

Indian Languages Corpora Initiative (ILCI)

The Indian Languages Corpora Initiative (ILCI), launched by TDIL, represents a significant effort to create national corpora based on standardized frameworks. In Phase 1, the initiative successfully developed parallel annotated corpora in 12 major Indian languages, including English, utilizing India's national standards for part-of-speech (POS) annotation. Following Phase 2, the total size of the corpora is now estimated to be around 27 million parallel annotated and chunked words, covering key domains such as: Health and Tourism (HT), Agriculture and Entertainment (AGENT)

Technology Development for Indian Languages (TDIL)

Initiated by the Ministry of Electronics & Information Technology, TDIL focuses on developing multilingual knowledge resources and language technology. This includes balanced corpora and machine translation systems, benefiting numerous Indian languages, including Assamese.

OPUS

A growing collection of translated texts, aligned as a parallel corpus for easy access.

Open SLR

Provides high-quality Assamese multi-speaker speech datasets, facilitating language research.

Gauhati University

Assamese Corpus was developed in the NLP Lab of Gauhati University. Total size of Assamese Corpus (in terms of words) is 1.6 million (1613551 words). The Corpus is prepared following the guidelines of Corpus Encoding Standard and is UNICODE encoded.

Kaggle

A platform for sharing and discovering datasets, including Assamese speech datasets.

TC-11Online Resources

Provides handwritten data from native Assamese writers.

FutureBeeAI

provides 2000+ datasets for AI development process.

Common Voice

Common Voice is a publicly accessible voice dataset. Its goal is to create an open-source, multilingual collection of voices that can be used by anyone to train speech-enabled applications.

ULCA

Universal Language Contribution APIs (ULCA) is an open-source, scalable data platform that supports a variety of datasets for Indic languages. It also provides a user-friendly interface for interacting with these datasets. For more information Click here

The resources available for Assamese

Resource	Resource Centre	Specification	Source
Text corpus	LDC-IL	1,01,27,030 words	View Resource
	sketchengine	2.5 million words	View Resource
	Metatext	7.6 million words words	View Resource
	b2find	1613551 words	View Resource
	AI4BHARAT		View Resource
Parallel text corpus	BHASHINI ULCA	1680416/td>	View Resource
Monolingual Chunked Text Corpus	TDIL	392700 words	View Resource
Synset	MeitY	Approx 14,958 synsets	View Resource
Speech corpus ASR	LDC-IL	54:21:12 Hours	View Resource
	AI4Bharat		View Resource
	EkStep-Unlabelled		View Resource
	EkStep-Labelled		View Resource
	Common Voice by Mozilla	Not Available	View Resource
	ACM Digital Librarary	20 hours	View Resource
	Assamese Speech Data-ASR	50:11:44 Hours	View Resource
Speech corpus TTS	AI4Bharat	53:06:00 hours	View Resource

Conclusion

The development of language resources for Assamese is a collaborative effort involving various institutions and initiatives. These resources not only support the advancement of language technology but also preserve and promote the rich linguistic heritage of Assamese.