Skip to main content | Skip to Navigation | Text Size : | Language : Other languages :

logo of Linguistic Data Consortium for Indian Languages (LDC-IL)
भारतीय भाषाओं के लिए भाषाई डेटा कंसोर्टियम (एलडीसी-आईएल)
Linguistic Data Consortium for Indian Languages (LDC-IL)

शिक्षा मंत्रालय, भारत सरकार
Ministry of Education, Government of India

Bhashini | LDC-IL

Bhashini

Introduction

The Digital India Bhashini project has been set up under the aegis of the Ministry of Information and Technology (MeitY) as a mission mode project for making rapid progress in speech, text, and vision technology for Indian languages. To achieve these goals Mission Bhashini has funded R&D projects proposed by premier institutions to collect data for machine translation, automatic speech recognition, text-to-speech, natural language understanding and optical character recognition. Funding has been provided to several research groups for building open training datasets and benchmarks for these tasks. These include a consortium led by IIT Madras to build speech technologies, a consortium led by IIITH, IITB and CDAC to build machine translation and a consortium led by IIITH to build OCR. In addition a special wing named Data Management Unit has been set up at IIT Madras for collecting, creating and collating the Indian languages datasets. The aim of the DMU is to provide a base layer of data infrastructure across languages and tasks while other projects in the mission will focus on building datasets and models for specific domains and language pairs. All RnD groups and DMU will release all created datasets in the open source with permissible licenses and will also upload them to the ULCA (Universal Language Contribution API) repository following open standards. It is expected that this spurs development of AI models and applications for large scale use across the nation.

To ensure that data collection at this scale happens in a standardized manner with appropriate guidelines and quality metrics, the DMU has been tasked with defining the specifications and processes for all tasks. More specifically, for all manual data which will be collected as a part of Mission Bhashini, this document will define (i) the processes to be followed while creating data, (ii) the annotation guidelines to be used, (iii) the specifications to be adhered to, (iv) the quality metrics and quality assurance procedures to be followed and (v) the policies and permissible licenses to be considered so that the data can be used without any restrictions by all stakeholders in the language technology ecosystem.

Background

To set the context for this report, we first briefly describe the tasks of interest as well as the types of data being collected under Digital India Bhashini Mission.

Tasks of Interest

The Mission will focus on developing datasets, benchmarks, models and tools for the following 5 tasks. The languages of interest include all the 22 constitutionally recognized languages of India.

  • Machine Translation (MT): Enable automatic translation of text segments between any of the 22 languages.
  • Natural Language Understanding (NLU): Enable automatic named entity recognition, sentiment analysis and question answering in all the 22 languages.
  • Automatic Speech Recognition (ASR): Enable automatic transcription of audio/video content in all the 22 languages.
  • Text-To-Speech Conversion (TTS): Enable automatic synthesis of audio from a given piece of text in all the 22 languages.
  • Optical Character Recognition (OCR): Enable automatic recognition of text present in documents and scene images in all the 22 languages.
Types of Data

Existing works have shown the utility of different types of data for building language technology. Webegin bylisting down these types below:

Manual:

This category of data is of the highest quality and involves human effort in creation without dependence on any existing models. In our classification, we consider a data to be of Manual type if it is created by language experts with strict adherence to guidelines on data collection. Examples include translating sentences from scratch or transcribing audio files. All benchmark quality data is manual.

Post-editing:

This category of data includes manually labeled data but with inputs from existing models. Examples include translating text by editing the output of a translation model such as IndicTrans. Such data may have biases introduced by the model that is used for generating the translation but comes with the benefit of faster human effort.

Crowdsourced:

This category of data is collected through crowdsourcing platforms, including the Bhashadaan platform. While humans are involved in the data collection process, given the distributed and largely unsupervised nature of the activity, there is no assurance of adherence to processes and quality criteria. However, such data is still useful in training models.

Machine discovered:

This category includes different types of data generated by models entirely with no human effort. The model used could be for different purposes. For instance, transcription models can be used to align chunks of audio data with transcripts available in document form or language encoders can be used to align bitext pairs across large monolingual corpora. Under high thresholds of alignment scores, such machine generated data can be of high quality.

Synthetic data:

Another type of machine generated data is data created directly by models. For instance, a parallel corpus can be created by a translation model such as IndicTrans. Such data is not considered of high quality given that the data is representative of the model that is used to create it. Such data can also be created algorithmically using templates (e.g., images of Curriculum Vitae (CV) created using standard CV templates).

Unsupervised:

The final category of dataset includes unsupervised data typically useful for pretraining models and for mining aligned pairs. For instance, for speech transcription, unsupervised audio data is used to pretrain models such as wav2vec. Similarly, for mining bitext pairs large monolingual corpora are used.

Task Unit Training Benchmark
Machine Translation Sentences3 100,000 5,000
Automatic Speech Recognition Hours 500-1,0004 50
Text-To-Speech Conversion Hours 40 -
Optical Character Recognition Images 20,000 20,000
NLU-Sentiment Analysis Sentences 10,000 10,000
NLU-NamedEntity Recognition Sentences 10,000 10,000
NLU-Question Answering Questions 10,000 10,000

Machine Translation

To train and evaluate Machine Translation systems for a given language pair we need parallel sentences for this language pair. Below we define the procedure for collecting such data with emphasis on specifications, annotation guidelines, workflow and quality assurance

Machine Translation Specifications

Parallel data collected for training and evaluating MT systems should meet the following specifications:

Diversity in domains: The parallel sentences should cover the following 13 domains: Legal, Governance, History, Geography, Tourism, STEM, Religion, Business, Sports, Health, Entertainment, Culture, and News. Of the total number of source sentences, at least 6-10% should come from each of these domains.

Diversity in lengths: For every domain of interest, it is desired that the following ranges of segment lengths have a good representation: 1-5 words, 6-10 words, 11-17 words, 18-25 words, > 25 words. At least 15% of the source sentences should be in each of the following buckets: 11-17, 18-25, > 25 words. Further, not more than 10% of the data should be less than 10 words. The average length of the English sentences in the entire data should be at least 15 words. For IL-IL parallel corpora, length distribution of source sentences would be worked out based on average length of sentences in the source language and be approved by the Expert Committee. In other words, the lengths shown for English (10 word, 15 word, etc.) would be suitably scaled while building such IL-IL corpora.

N-way parallel: The data should have n-way parallel sentences, i.e., a large fraction of the data should contain the same sentences translated to all the 22 constitutionally recognized languages. This would allow building and evaluating Indic-Indic translation systems at much lower costs. To create such n-way parallel data, different languages could be used as pivots (e.g., English, Tamil, Hindi, etc). The information about the pivot language should be clearly stored in the meta-data. (Additional guidelines would be drawn for IL-IL parallel corpora with and without pivots.)

Source original: For each language, the data should contain some sentences which were originally written in that language and then translated to English or any other Indian language.

Discourse level translations: Instead of collecting translations of isolated sentences, it is preferred to translate entire paragraphs or a collection of 3-5 contiguous sentences so that the data can also be used for training/evaluating discourse level translation models. DMU recommends that at least 20% of the data should contain discourse level translations.

Downstream applicability: While collecting training data from a variety of domains would be useful, one should also focus on collecting data for building practical applications, such as translation for everyday usage/conversations. DMU recommends that at least 25% of the data should contain sentences from everyday conversations.

Unbiased to specific NMT systems: Given that good translation systems are already available for several Indian languages, it is unavoidable that translators would rely on certain publicly available MT systems to get an initial partially correct translation and then edit it. It is important to sensitize the translators and have in-built mechanisms in data collection tools so that (i) while creating training data, a translator is shown output of several MT systems instead of just one and (ii) while creating benchmark data, a translator is not allowed to use the output of any MT system but instead force to translate from scratch. Of course, there should not be any overlap between training and test data. Particularly, there should be no overlap across languages.

Compliance to formats: The collected data should be in UTF-8 encoding and be ULCA compliant. In addition the data should not contain any objectionable content.

Machine Translation Guidelines

Below we describe the guidelines to be used while translating sentences from source language. These guidelines are partly inspired from similar guidelines prepared by LDC for the BOLT Chinese-English translation task.

General Principles

The translation in the target language must be faithful to the text in the source language in terms of both meaning and style. The translation should mirror the original meaning as much as possible while preserving grammaticality, fluency, and naturalness.

To the extent possible, the translation should have the same speaking style, tone or register as the source. For example, if the source is polite, the translation should maintain the same level of politeness. If the source is rude, excited, or angry, the translation should convey the same tone.

The translation should contain the exact meaning conveyed in the source text and should neither add nor delete information. For instance, if the original text uses Modi to refer to Honorable Prime Minister Narendra Modi, the translation should not be rendered as Prime Minister Modi, Narendra Modi, etc. No bracketed words, phrases or other annotation should be added to the translation as an explanation or aid to understanding.

All sentences should be spell checked and reviewed for typographical errors before submission.

While writing sentences in Indian languages, the official native script of the language should be used as mentioned below:

  • Bangla script for Bengali, Assamese
  • Devanagari script for Bodo, Dogri, Hindi, Konkani, Maithili, Marathi, Nepali, Sanskrit,
  • Sindhi
  • Gujarati script for Gujarati
  • Gurumukhi script for Punjabi
  • Kannada script for Kannada
  • Malayalam script for Malayalam
  • Meitei Mayek script for Manipuri
  • Odia script for Odia
  • Ol Chiki script for Santali
  • Perso-Arabic script for Kashmiri, Urdu
  • Tamil script for Tamil
  • Telugu script for Telugu
Named Entities

Named entities in English which have a well accepted conventional translation in the regional language should be translated using this conventional translation. For example, Indian Institute of Technology would be translated as “भारतीय प्रौद्योगि की सस्ं थान” in Hindi.

If a well accepted conventional translation of the English named entity does not exist in the target language, then the named entity should be transliterated. For example, “Pope Francis” should be translated as “पोप फ्रान्सि स” in Hindi.

In all cases, avoid inventing translations of named entities in the target language if they do not exist already. Use transliteration instead.

The above rules are language specific, and it is possible that an English named entity gets translated in one Indian language and transliterated in another. The key deciding factor is the presence or absence of a well accepted conventional translation of that named entity in that Indian language.

Code Mixing and Borrowing

In everyday usage it is common to use code-mixing (e.g., in spoken conversations, English terms are commonly mixed with the native language by many native speakers in India). Such code-mixing is acceptable while translating informal content, such as everyday conversations and voice commands. However, such code-mixing should be avoided while translating formal content.

Similarly, many words have been borrowed from English into Indian languages and are now nativised (e.g., train, computer, internet, etc). The translators may use such code mixed and borrowed terminology in the translation if it is a more well accepted term in the target language than a pure translation in the target language (e.g., we refer to “सगं णक” as the pure translation of computer as opposed to “कम्प्यटू र” which is also well accepted) . However, this may not be imposed very strictly as some variety in the corpus is also desired (e.g., some sentences which use the pure translation “सगं णक” and some which use the borrowed word “कम्प्यटू र”). Note that this is applicable to both formal as well as informal content.

Factual errors in the source sentence should be retained as it is. For example, if the source sentence says “Ranveer Singh and Alia Bhatt starrer Brahmastra will release in theaters today” then the translation should also contain this factual error and not correct it to Ranbir Kapoor. Such factual errors may also be present in descriptions of historical events (e.g., dates may be incorrect or alternative versions of events may exist). Similarly, such factual errors may also be present in scientific theories which are disputed/controversial. In all such cases, the translators should produce a translation which is faithful to the given source sentence.

Spelling mistakes in the source sentence should be corrected. If the source sentence has severe grammatical errors then it should be discarded. If the source sentence belongs to formal content and has minor grammatical errors then such errors should be corrected. If the source sentence belongs to informal content (e.g., everyday conversations) and contains minor grammatical errors then such errors may be retained as it is as they are a reflection of everyday usage of the language.

Numbers and Units

Numbers in the translation should either be spelled out in full or written as digits, according to how they appear in the source text.

It is acceptable to use English numerals instead of their equivalents in the regional language. However, we leave this choice to the language experts with the understanding that this choice should be consistent across sentences (i.e., either use English digits in all sentences or regional digits in all sentences).

Roman numerals in English should be retained as it is in the target language.

Big numbers (upto million), should be translated using the conventions of the target language. For example, 700 million should be translated as 70 करोड़ as opposed to 700 मि लि यन. Very large numbers, such as billion and trillion could be translated as it is (e.g., 7 billion should be translated as 7 बि लि यन). However, alternative translations containing DMU Report, Digital India Bhashini Mission August 202 regularly used terms in the target language are also acceptable if they are popular and well accepted in the target language. This would ensure some diversity in the corpus.

For units of measurement that may differ between English and Indian languages (for example "miles" v/s "kilometers" or “gallons” v/s “liters”), the translators should produce a translation which retains the units as mentioned in the source sentence. For example, “3 miles” should be translated as “3 मील” and not “4.8 कि लोमीटर” (even though kilometer is a more popular/acceptable unit in India).

Dates

Dates in the translation should either be spelled out or written as digits, according to how they appear in the source text. For example, 17 January 2022, would be translated as “17 जनवरी 2022” and not “17-01-2022”.

Dates written in numeric format (mm-dd-yyyy, dd-mm-yyyy, dd-mm-yy, etc) should be translated as they occur in the source sentence. For example, the English date “01-09-2022” should simply be translated as “01-09-2022” in Hindi.

The year should be translated using 4 digits or 2 digits depending on how it appears in the source sentence. For example, “01-09-21” should be translated as “01-09-21” in Hindi and not as “01-09-2021” (even though the latter translation has no ambiguity).

Technical terms

For translating technical terms, translators should refer to the class 1 to class 12 books in the native language provided by NCERT, NIOS of state boards. The translators should refer to the translation dictionaries prepared by the Commission for Scientific and Technical Terminology (CSTT) and TDIL 5 for different domains (Science, Engineering/Technology, Medical Science, Humanities, Social Sciences, Agricultural Science, Veterinary Science).

If a technical term does not have a native translation in NCERT, NIOS or other state board textbooks or in CSTT dictionaries or if the translation in the CSTT dictionary is too archaic/academic, then it should be transliterated into the target language. Note that in many languages, such as Sanskrit, Santali which have limited Western influence, a large number of terms in English will have to be transliterated. For example, terms like “penalty shootout” do not have a well accepted native translation in Sanskrit. While a Sanskrit term for “penalty shootout” can be coined it may not sound natural in the context of the sport. Hence, it is acceptable to transliterate such terms

Acronyms in Roman script should be mapped character by character, with periods between mapped characters on the Indian language side. E.g. If the English sentence contains CSTT, the Hindi side equivalent would be सी. एस. टी. टी.

It is also recommended that in cases where a technical term is translated, the transliteration can also be retained in the bracket with the following tag . For example “मझु ेएक सगं णक (कम्प्यटू र) चाहि ए”. This will give downstream users more flexibility in using the desired term in the target language.

Machine Translation Collection Workflow

The following workflow is recommended for creating parallel sentences:

Curation of source sentences: The source sentences (i.e., sentences in the source language) should be carefully created to meet the diversity criteria in the specifications mentioned above. It should be ensured that the source sentences are unique, i.e., (i) there are no duplicates across different RnD projects and DMU and (ii) there are no duplicates in existing parallel sources of data. To do so, all projects should upload their source sentences in a common repository before starting the translation process. This common repository will be set up by the DMU.

Verification of source sentences: Source sentences should be verified to ensure that there are no spelling errors, grammatical errors, factual errors and improper encodings. The source sentences may contain punctuations, quotes and content in brackets which should be retained as it is. During the verification stage, it should also be ensured that the source sentences meet the specifications described earlier

Translation of source sentences: The translations should be done using the guidelines described earlier. To facilitate this the tools used for translation should ensure the following

While creating benchmark data, translators do not take help of any MT systems. Translators have access to standard online dictionaries (e.g., Shoonya integrates dictionaries prepared by CSTT).

The tool allows translators to add notes and/or skip difficult sentences for discussion later

The tool allows translators to see translations in other languages (e.g., while translating from English to Kashmiri, the translator may want to see the Urdu and Hindi translations)

If the translator relies on the output of a MT system then the tool should allow the translator to log this information

Verification of translations: Each translation should be verified by a different translator (preferably, a senior translator). The verifier should follow the same guidelines as listed above and wherever needed, edit the translation to make it perfect.

Machine Translation Quality Assurance

It is important that the data created by the above workflow is independently certified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below.

Selection of certifiers: It is important that the certifiers assigned for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitized about the desired quality of translators. DMU recommends that translators with at least 15 years of experience should only be considered for this task.

Verification Tool: DMU will build a certification interface in the open source tool Shoonya and host it on the Bhashini platform (e.g., https://bhashini.gov.in/shoonya). This interface will allow human certifiers to see a translation pair and check for its correctness.

Creation of verification task: DMU recommends that after every 10,000 sentences are translated by a group (DMU or RnD projects), these sentences should be uploaded on the above verification tool. In addition, the group should also specify whether this is training data or benchmark data. For training data, the tool will randomly sample 10% of the data and show it to the human certifiers. For benchmark data, the tools will show the entire data (100%) to the human certifiers.

Verification: For each translated sentence pair, a human certifier is required to rank the sentence pair on quality. DMU recommends ranking sentence pairs based on established criteria such as the 0–4 scale used in many MT evaluations . Given certain linguistic variations and preferences, sentence pairs scored at 3 and above can be considered to be of good quality.

Quality metric: Once the sentences are ranked as per the scoring scale, the tool will compute the percentage of sentence pairs which pass the verification.

Acceptance criteria: The uploaded batch of sentences will be accepted only if the percentage of words that were edited is less than 10% per batch.

Rework: In case, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification.

Apart from certifying the quality of individual batches, at the end of data collection the certifier should also certify if the data meets the specifications mentioned earlier for (i) domain distribution, (ii) length distribution, and (iii) UTF-8 encoding 6.

Machine Translation Licensing Considerations

It is important that all data collected, either source data (such as source sentences) or derived data (such as translations), have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Permissible licenses of source sentences: While curating sentences in the source language for translation, it should be ensured that such sentences are curated only from sources which come under permissible licenses such as CC BY 4.0 license 7(e.g., Wikipedia) or from sources which explicitly grant consent for free usage for all purposes.

Consent of content creators, translators and/or translation agencies: We expect that most translations and verifications will be done by in-house translators or external data collection agencies. Further, while most source sentences may be derived from permissible online sources, some source sentences may be explicitly created with the help of content creators (e.g., everyday conversations). Explicit consent of content creators, in-house translators and data collection agencies should be taken to ensure that in the future there are no restrictions in distributing the data freely for all purposes.

Automatic Speech Recognition

To train and evaluate Machine Translation systems for a given language pair we need parallel sentences for this language pair. Below we define the procedure for collecting such data with emphasis on specifications, annotation guidelines, workflow and quality assurance

ASR Specifications

To train and evaluate ASR systems we need audio files and their captions. Such parallel data should have the following characteristics:

Diversity in collection method: The data should contain a mix of (i) audio samples collected from speakers on the ground and (ii) audio samples taken from existing content such as news, educational videos and entertainment videos provided such content is available with permissible licenses. At least 25% of the data should belong to each of these categories accounting for a total of 50% of the data. The respective groups have freedom in selecting the collection method for the remaining 50% of the data (for some languages like Hindi it maybeeasier to find existing content whereas for some languages like Santali it may be easier to collect data on the field).

Diversity in speakers: For every language, the audio data collected on the ground should come from a wide variety of speakers having different accents (e.g., Surat v/s Vadodara), different ages (18-30, 30-45, 45-60, >60), different educational backgrounds (school level, graduate, post-graduate) and different genders. DMU recommends the following specifications: Gender diversity: 45- 55% male speakers, 45- 55% female speakers Age diversity: At least 15% speakers each from the age group of 18-30 years, 30-60 years and 60+ years. Location diversity: At least 60% of the districts for a given language should be covered. For languages like Hindi, which are spoken in greater than 100 districts, at least 50 districts should be covered. Urban/Rural Diversity: To the extent possible, the percentages of speakers from urban and rural areas for a given language should closely match the percentages mentioned in the census for that language. Education diversity: At least 10% speakers should belong to each of the following categories: no schooling, up to 12th standard, graduate, post-graduate. The amount of data collected from a single user should ideally not exceed 30 minutes.

Diversity in vocabulary: The audio should contain words from a wide variety of domains. This is applicable for read speech where the sentences should come from the following 13 domains: Legal, Governance, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News. Each domain should have at least 5% representation in the data.

Sizeofvocabulary: The size of the vocabulary as computed from the transcripts of all the collected data should be at least 50K words per language.

Diversity in channels: The data can be collected over wide band channels (e.g., on-device recorders or apps on mobile phones) or narrow band channels (e.g., telephony based voice calls). It is strongly recommended that 65-75% of the data is from wide-band and 25-35% data is from narrow-band. In the below discussion, we will refer to wideband as WB and narrowband and TEL.

Diversity in content: While collecting data from the ground, it is recommended that at least 50%ofthe speakers contribute at least 2 of the following types of data: Read speech: reading sentences from diverse domains (WB) Voice commands from the domain of digital payments and governance (WB) Voice commands from the domain of e-commerce payments (TEL) An extempore get-to-know-me interview (hobbies, favorite story, etc) (WB) An extempore role-playing conversation (e.g., banker-customer) (TEL) An extempore conversation on a topic of interest (e.g., yoga, government schemes, etc.) (WB) For each type of content, we have also specified the preferred channel over which the data should be collected. It would be too restrictive to specify the exact distribution of each of the above for a given speaker as it depends on the ability of the speaker. The only recommendation is to ensure some diversity in the content being collected from each speaker. Despite best efforts, some speakers may only be able to contribute one type of data (e.g., some speakers may not be comfortable in being extempore, some speakers may not be able to read the given sentences and so on).

Unbiased to specific NMT systems: Given that good translation systems are already available for several Indian languages, it is unavoidable that translators would rely on certain publicly available MT systems to get an initial partially correct translation and then edit it. It is important to sensitize the translators and have in-built mechanisms in data collection tools so that (i) while creating training data, a translator is shown output of several MT systems instead of just one and (ii) while creating benchmark data, a translator is not allowed to use the output of any MT system but instead force to translate from scratch. Of course, there should not be any overlap between training and test data. Particularly, there should be no overlap across languages.

Diversity in recording devices: While collecting data from the ground, it is recommended that speakers be allowed to use their own mobile phones for recording. For WB data, the recordings can be done using on-device recorders or apps. For TEL data, the recording can be done by setting a bridge and allowing the participant(s) to join this bridge by making a telephone call. This will ensure diversity in the recording devices and channels used for collecting audio data. Given the large number of speakers involved in the collection process, there will be a natural diversity in recording devices and hence no additional specifications are mentioned here. Further, given that the telephone calls will be made using different devices, there will be diversity in the cellular technology being used (2G, 3G, 4G, etc).

Diversity in noise: At least 10% and not more than 20% of the data should be collected in a noisy environment with natural background noises such as fans, vehicles, people, etc. The SNRfor such noisy data should be around 10dB.

Diversity in genres: While collecting audio samples from existing content, it is recommended that at least 2 of the following 4 genres are covered. News:Thiswill be primarily sourced from news channels and can be further categorized into the following types: Headlines: This is content of the type “Top 20 headlines of the hour” which does not have high speaker diversity but has peculiar characteristics like jarring background music. On-fieldreporting: This is content of the type “cameraman Prakash ke saath….” which is extempore, has background noise and involves common people on the ground. Debates: Such content would have diversity in content (government policies, banning an outfit, etc.) and will also have peculiar characteristics like emotional outbursts, overlapping chatter, etc. Interviews: Such content would involve a news anchor and 1-2 experts and caters to a variety of topics. The experts do not follow a script, so the content has the flavor of natural speech. Special reports: Such content involves people on the ground and has good vocabulary spanning multiple domains. Entertainment: This will be primarily sourced from entertainment channels and would include content from different genres: family shows, comedy shows, crime shows, reality shows, cooking shows, travel shows, songs Education: This will be primarily sourced from education channels and would contain content from STEM, Health and How-to videos. Callcenter: This will be primarily sourced from call centers catering to one or more of the following domains: agriculture, legal, banking, insurance, health. It should be ensured that the data is suitably anonymised and contains no personally identifiable information (PII). It would again be too restrictive to specify the exact distribution for each of the above categories as it depends on the availability of such content with permissible licenses. For example, call center data for most languages may be hard to procure due to privacy issues.

Complianceto formats: The collected data should meet the following format specifications: (i) all audio files should be in .wav format with a sampling rate of at least 16kHz (ii) all transcripts should be in UTF-8 encoding (iii) all transcribed data should be ULCA compliant. In addition the data should not contain any objectionable content. While the above specifications are recommended, it also recommended that there should be a provision for the PIs to make a case for changing these specifications in response to the prevailing conditions. For example, if no existing content is available for some language with permissible licenses then the proportion of such data would have to be reduced to 0. Such concessions should also be made if it would take a long time to procure such existing content or if the procured content is not of the desired quality (e.g., if an existing media channel provides us 100 hours of content in Santali but if all of it is Math-Education from a single speaker then there is not much utility in collecting/labeling such content). Similarly, for some languages there might be logistical issues in meeting the specified distribution of speakers (e.g., speakers with different education levels). Such concessions can be made on a case-by-case basis by the Project Review Committee.

ASR Guidelines

Collecting Voice Data from Native Speakers:

We nowlist down the guidelines for collecting voice data from native speakers at a temporary location. These guidelines are taken (almost verbatim) from the guidelines prepared by NIST for the task of speaker recognition.

Collection Environment:

The collection environment should

Beanindoorspaceasfree as possible from background noises such as air conditioners, generators, fans, or other motorized or electrical devices. Avoid locations that have music, white noise, or other audio playing in the background at any audio level. It may be necessary to turn the interference source off to fully mitigate it.

Ideally, be a location that is not near outside traffic noise (human, animal, vehicular, or aircraft) but a small fraction of the data (not exceeding 20% of the total data) can be collected in such environments.

Haveaminimumoflarge, flat, hard sound-reflective surfaces which can cause reverberation and echoes. The effects of a reverberant room can be mitigated by hanging fabric (curtains, blankets, etc.) or other sound deadening materials on the walls or as dividers in the room.

Allowtheparticipant to be as comfortable as possible, preferably sitting, to lower cognitive/voice stress levels and to facilitate natural conversation.

Collection Equipment: Although high-quality digital audio recording equipment is preferred, recordings made with equipment meeting the minimum requirements detailed below can be used. In particular, it is recommended that all recordings should be done using the mobile phones of participants as this would be closer to the real-world scenarios in which ASR systems would be deployed. Allowing each participant to use their own device will also ensure that there is enough variety in the input devices that are used (e.g., on-device mics of different phones, headsets of different brands, etc.). Nothing in these requirements precludes the concurrent use of multiple recording devices if that is required for the intended application. An example of such a requirement would be to create a pair of recordings in which one is a high-quality reference while the other is condition-matched to a specific use case. Recording devices should fulfill the following requirements:

Thespeechmustberecorded digitally and saved as uncompressed PCM data with at least 16-bit samples at a minimum rate of 16,000 Hz for wide band and 8000 Hz for narrow band. The audio can be mono or stereo.

Theaudioshould be saved in a standard lossless file format such as PCM-WAV, or be in a file which can be converted to a standard format without loss of fidelity. The audio should not be saved in a file format such as MP3 or WMA which use a lossy codec to compress the audio data. Any type of automatic gain control (AGC) on the microphone or recorder should be turned off/disabled during the recording session.

Forrecordings made using laptop or other computers, it is preferred to use an external USB condenser microphone with an on-board analog-to-digital (A/D) converter. This is because the internal microphone or external microphones plugged into a “mic” port can pick up noise from internal circuitry.

Thesubject’s microphone should ideally be a headset mic or the in-built mic of the subject’s phone. This will ensure there is enough diversity in the headsets and phones used for recording.

Theinterviewer should have some indicator available on the recording device that shows that the audio is being recorded at an appropriate amplitude level and not too low (resulting in a noisy recording due to quantization effects) or too high (which causes clipping and thereby introduces nonlinear distortion into the audio stream).

Thereshould besomemethodavailable to back-up the collected data, such as writing it to external hard drives, USB thumb drives, or online storage repositories.

For recording telephony data (e.g., extempore role-playing conversations) a bridge may be set up and the participant(s) can dial into the bridge using their mobile phones to have a telephonic conversation.

Speech Collection:After the collection environment and equipment have been arranged, the interviewer should record and audibly review an initial sample of test speech in the same recording environment using the same equipment as for the collection to confirm that the equipment is working properly, and the audio quality meets the parameters discussed above. This mayalso expose other sources of noise not originally noted, such as the buzz of fluorescent lights or sounds from air handlers, which can be addressed. Once the setup is verified, the following details should be captured (i) brand of phone (ii) model of phone (iii) OS and (iv) price range. Further if a headset was used then the brand, model and price range of the headset should also be captured. Once recording begins, either the interviewer or the subject must provide a preamble with some subject identifying information along with the date, time and location of the recording session. During the recording, the interviewer should strive to elicit periods of conversational speech from the subject. Conversational speech could be elicited in multiple ways, such as:

Ask the subject to discuss an article from a local newspaper, news website, or social media outlet.

Asking open-ended questions or prompts. A list of possible questions is given in the table below. This list is not exhaustive, and the interviewer should tailor any questions to be appropriate for the circumstances, the subject’s culture, etc.

  • Whoarethe members living in your family?
  • Describe your favorite place in your city/town/village?
  • Tell us your favorite children’s story?
  • Tell us about your favorite childhood memory?
  • Can you tell us about your favorite dish and how you make it?
  • Can you tell us about your favorite dish and how you make it?
  • Can you describe a train?
  • Try to describe your best friend as vividly as possible. What do you like and dislike about him/ her
  • Imagine that you have become extremely rich one day. What would you do with all the money you have?

It can be expected that the longer the subject speaks conversationally (presuming that fatigue does not occur), the greater chance that they will become comfortable with the collection situation, resulting in a more “natural” speech sample. The interviewer should avoid interjections while the subject is speaking (e.g., nodding to acknowledge the subject instead of saying "uh-huh").

Transcribing audio data

Wenowdescribe the guidelines for transcribing audio data. These guidelines are inspired by similar guidelines created by a commercial transcription agency and by NIST.

General Principles

Transcribe a word only if you can hear and understand it properly. If the spoken word/text cannot be understood due to the speaker’s manner of speech then mark it as [unintelligible]. On the other hand, if the spoken text cannot be heard due to poor recording, volume or noise then mark it as [inaudible].

Itisrecommended that the boundary between two utterances in a long audio should be labeled using the tag.

Donotparaphrase the speech.

Donotcorrect grammatical errors made by the speakers.

Alwaysusethecorrect spelling for misspoken words. Example: If a speaker pronounces "remuneration" as “renumeration” then it should still be transcribed as “remuneration”.

Capitalize the beginning of every sentence.

Donotexpandspokenshort forms (e.g., ain’t, don’t, can’t, it should be retained as it is)

Retaincolloquial slang as it is (e.g., gotcha, gonna, wanna, etc).

Non-native words (e.g., English words in Hindi speech) should be transcribed using the script of the word (English, in this case) unless the word is borrowed and nativised in the target language. For example, words like train, computer, internet are borrowed in Hindi from English and can thus be written using the Devanagari whereas English words like “work, cook, etc” which are not borrowed in Hindi but may occur in code-mixed conversations should be written using the English script.

Ifthespoken words belong to a language that the transcriber does not know then it should be tagged as with the appropriate timestamps.

Verbatim transcription

The speech should be transcribed verbatim. However, the following rules should be used for transcribing errors made by the speakers.

Errors that should be transcribed as it is:

Speecherrors: “I was in my office, no sorry, home” should be transcribed as it is.

Slangwords: kinda, gonna, wanna, etc should be transcribed as it is.

Repetitions: “I have I have got the book” should be transcribed as it is.

Errors that should not be transcribed:

Falsestarts: “I, um, er, I was going to the mall” should be transcribed as it is.

Filler sounds: um, uh, er, hmm, etc. should be transcribed as it is.

Stutters: “I w-w-went t-t-to the mall” should be transcribed as it is. Following the guidelines8 for the SWITCHBOARD corpus there should be a hyphen between the stutters as shown in the above sentence.

Non-speech (acoustic) events

Backgroundnoise such as “fan whirring”, “dog barking”, “engine running”, “water flowing”, etc. should not be transcribed.

Foreground sounds made by the speaker should be transcribed. These include lip smacks, tongue clicks, inhalation and exhalation between words, yawning coughing, throat clearing, sneezing, laughing, chuckling, etc. The categories specified in Table 1 in the guidelines for the SWITCHBOARDcorpus mentioned earlier, should be used.

Inthecaseoftranscribing telephone calls, foreground sounds like machine or phone click, telephone ring, noise made by pressing telephone keypad, any other intermittent foreground noise should not be transcribed.

Names, titles, acronyms, punctuations, and numbers

Propernamesshould be transcribed in a case-sensitive manner in applicable languages. Initials should be in capital letters with no period following. For example: “M K Stalin would be sworn in as the Chief Minister”.

Titles and abbreviations are transcribed as words. For example: Dr. → Doctor except if the abbreviated form is pronounced as it is. For example, if the speaker says “Apple Inc” (instead of “Apple Incorporate”), the word 'Inc' should be transcribed.

Punctuation marks should be used in transcription as appropriate. For example: "don't"

Acronymsshouldbetranscribed as words if spoken as words, and as letters if spoken as letters. When transcribing sequences of letters an underscore is inserted between each letter. For example: NASA; I B M

Numbersshouldbetranscribed as they are spoken and normalized to word form (no numerals). For example: 16 → sixteen, 112 → one hundred and twelve.

Timesofthedayanddates: always capitalize AM and PM. When using o'clock, spell out the numbers: eleven o'clock.

Speaker Label

Markeveryutterance with a speaker label

Ifthespeaker’s name has been mentioned in an earlier utterance, then use this as the speaker label

Ifthespeaker’s name has not been mentioned earlier then simply use generic labels such as Speaker 1, Speaker 2, …, and so on while ensuring that the same label is consistently used for the same speaker

Incomplete utterances

These are utterances which are incomplete because the speaker forgot what he wanted to say or was stopped mid-way and corrected an error or was interrupted by someone. Indicate such utterances by putting a ‘--’ at the end of the utterance as opposed to a full-stop or question mark.

ASR Collection Workflow

Collecting Voice Data from Native Speakers

While collecting data from speakers on the ground the following workflow should be followed:

Collecting Speaker information: The collection agency should collect the data about speakers (age, gender, education, location, topics of interest, etc) 2 weeks prior to the data collection process. This data should be analyzed by the project groups (DMU or RnD groups) to ensure that it meets the required specifications of diversity. The collection agency should also request the speaker to sign a consent form (see Appendix A) to give irrevocable license of the collected voice samples to MeitY.

Verifying Speaker information:Before starting recording, the above speaker information should be verified and it should be ensured that the specifications defined earlier w.r.t. speaker diversity should be adhered to.

Preparing Data:As mentioned earlier, the intention is to collect a variety of data from each speaker, such as read speech, extempore conversations on topics of interest, role playing conversation. To ensure this the following preparation is required:

Sentences for read speech should be collected and it should be ensured that they meet the data distribution specifications mentioned above (at least 5% data from each of the 12 domains).

Scenarios for using voice commands for digital payments and e-commerce should be designed (e.g., transfer money, check balance, order dairy products, etc).

Aget-to-know-mesurvey covering hobbies, family, movies, music, stories, books, etc should be prepared.

Basedontopics of interest entered by the speaker (which come from a predefined list of topics) appropriate conversation scenarios should be created (e.g., “Tell us why do you enjoy playing cricket?”)

Scenarios for role-playing conversations should be created (e.g., “Imagine you are customer conversing with a dairy shop owner to complain about the quality of milk delivered today”)

It should be checked that none of the above content has any offensive/objectionable material.

Recording voice samples: DMU recommends using Karya app for wideband data collection (of course, RnD groups are free to design their own apps/tools). Such a tool should allow the users to read the input (sentence, scenario for a voice command, scenario for role-playing, etc) and record their audio response based on the input. Before submitting, the speaker can listen to the recording and click on submit only if the recording is appropriate, i.e., (i) it is an accurate response to the input (ii) it has no background noise and (iii) it is at an appropriate volume level. All other instructions mentioned in the guidelines above should be followed. In particular, the specifications for equipment, ambience, format, etc are already mentioned in the guidelines. For recording telephony data (e.g., extempore role-playing conversations) a bridge may be set up and the participant(s) can dial into the bridge using their mobile phones to have a telephonic conversation.

Verifying voice samples:DMU recommends using Karya app for verification of collected voice samples (of course, RnD groups are free to design their own apps/tools). Such a tool should allow the users to read the input and the corresponding audio response. The verifier should judge the sample on accuracy, noise and volume on a scale of 0 to 2 (2 being perfect). Only those voice samples which have a rating of 2 on all three parameters should be accepted. Note that it is desired that a small percentage of the data should have some noise as noise is inevitable in real word scenarios in which ASR models will be deployed. Hence, it is beneficial to train and evaluate ASR models on noisy data. DMU recommends using 20% of noisy data for training (with SNR 10dB), over and above the full quota of clean data defined in the specifications.

Transcribing voice samples:Since a significant fraction of the data will be collected in extempore mode, the data would have to be transcribed. The workflow for this will be the same as for transcribing existing content as described below.

Transcribing audio data

While transcribing audio data the following workflow should be followed:

Identifying audio data: Some of the audio samples to be transcribed will come from the data collected on the field. These have already been verified and hence no further checks are needed. While selecting audio samples from existing content, it should be ensured that the samples are selected as per the specifications mentioned earlier.

Transcribing audio data: DMU recommends using the Chitralekha tool for transcribing audio (of course, individual RnD groups are free to use any other appropriate tool). The tool should have the following features: (i) allow the user to load an audio file (ii) optionally, generate an automatic transcription for the audio (iii) allow the user to edit the transcription or generate it from scratch. While transcribing the audio, the transcriber should follow all the guidelines defined earlier.

Verifying transcribed data:DMU recommends using the Chitralekha tool for verifying transcriptions (of course, individual RnD groups are free to use any other appropriate tool). The tool should allow the user to load the audio file and the previously generated transcription. While verifying/editing the transcription, the verifier should follow the same guidelines as defined earlier.

ASR Quality Assurance

It is important that the data created by the above workflow is independently verified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below.

Selection of certifiers:It is important that the certifiers used for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitized about the desired quality of transcribers. DMU recommends that transcribers with at least 15 years of experience should only be considered for this task.

Verification Tool:DMU will build a certification interface in Shoonya and host it on the Bhashini platform . This interface will allow human verifiers to see an audio-transcription pair and check for its correctness.

Creation of certification task:

DMU recommends that after every 50 hours of audio data is collected and transcribed by a group (DMU or RnD projects), the transcriptions along with the audio should be uploaded on the above certification tool. In addition, the group should also specify whether this is training data or benchmark data. For training data, the tool will randomly sample 10% of the data and show it to the human certifiers. For verification data, the tools will show the entire data 100 (100%) to the human certifiers.

Certification:The certifiers selected for this task will see a sample of the data and for every audio-transcription pair they will check the overall correctness of the transcription. Any insertions or deletions needed in the transcription, will lead it to be deemed incorrect. But with regards to edits, the following will be ignored: minor spelling errors, variants in spellings of named entities, variants in the written forms of some words (such as चलए vs चलय)

Quality metric:Once the transcriptions are checked, the fraction of sentences which are shown to contain insertions, deletions, or edits (excluding the above) is computed.

Acceptance criteria: The uploaded batch of data will be accepted only if the word error rate is less than or equal to 10% for training data and less than or equal to 5% for benchmark data.

Rework:Incase, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification.

Apart from certifying the quality of individual batches, at the end of data collection the certifier should certify if the data meets the specifications mentioned earlier for (i) gender diversity (ii) age diversity (iii) occupation diversity and (iv) genre and content distribution. Note that a completely independent verification of this is not possible and the certifier will just have to rely on the meta-data provided by the data creator. However, the certifier has the right to manually verify the meta-data if needed

ASR Licensing Considerations

It is important that all data collected, either source data (such as source audio for transcription) or derived/collected data (such as audio collected from participants or transcriptions created), have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Permissible licenses of source sentences:While curating sentences for read speech, it should be ensured that such sentences are curated only from sources which come under CCBY4.0 license (e.g., Wikipedia) or from sources which explicitly grant consent for free usage for all purposes.

Consentofspeakers:Explicit signed consent should be taken from speakers and data collection agencies before recording any voice samples to ensure that in the future there are no restrictions in distributing such data to all stakeholders in the language technology ecosystem.

Consentoftranscribers and/or transcription agencies: We expect that most transcriptions and verifications will be done by in-house transcribers or external data collection agencies. Explicit consent of in-house transcribers and data collection agencies should be taken to ensure that in the future there are no restrictions in distributing the data freely for all purposes.

Text To Speech Specifications

To train and evaluate TTS systems we need high quality audio recordings from professional voice artists along with textual scripts/prompts. Such data should have the following characteristics:

High quality recording: The data should be collected in a studio setup.

High quality voice: The data should be collected from a professional voice artist who is different from the voice artists in existing TTS datasets.

Diversity in domains: The spoken content should contain words from a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News. This would ensure good coverage of vocabulary and phonemes.

Diversity in speakers: There should be at least one male and at least one female speaker for each language.

Diversity in content: The scripts used for recording should contain a mix of short statements, long statements, questions, exclamations and short phrases. We recommend the following distribution: statements (60%), questions (10%), exclamations (10%) , short phrases/commands (10%), and stories (10%) to ensure prosodically rich data. It should also be ensured that the scripts used for this data collection have very little overlap with the scripts used for existing TTS datasets for Indian languages.

We recommend the following additional specifications, as per guidelines published by Microsoft.

Property Value
File format *.wav, Mono
Sampling rate 48 kHz
Sample format 16 bit, PCM
Peak volume levels -3dB to -6dB
SNR > 35 dB
Silence There should be some silence (recommended 100 ms) at the beginning and ending, but no longer than 200 ms
Silence between words or phrases < -30 dB
Silence in the wave after last word is spoken < -60 dB
Environment noise, echo The level of noise at start of the wave before speaking < -70 dB

TTS Guidelines

We now describe the guidelines for collecting voice data from professional artists for training text-to-speech systems. These guidelines are inspired by similar guidelines published by Microsoft.

Choosing voice artist

The voice artist should be a professional with proven experience in voiceover or voice character work.

The natural voice of the artist should be good (as opposed to an “assumed” voice which would be hard to sustain over a long period of time).

The voice artist must have clear diction and must be able to speak with consistent rate, volume level, pitch, and tone.

The talent also needs to be able to strictly control their pitch variation, emotional affect, and speech mannerisms.

With the help of experienced experts, it should be verified that the voice timbre does not change when the pitch and speed are changed.

Choosing a recording setup

The script should be recorded at a professional recording studio that specializes in voice work and has a recording booth, the right equipment, and the right people to operate it.

The recording should have little or no dynamic range compression (maximum of 4:1).

The audio should have consistent volume and a high signal-to-noise ratio, while being free of unwanted sounds.

Recordings for the same voice style should all sound like they were made on the same day in the same room. This can be achieved through good recording practice and engineering.

Recording requirements

To achieve high-quality training results, adhere to the following requirements during recording or data preparation:

Clear and well pronounced.

Natural speed: not too slow or too fast between audio files.

Appropriate volume, prosody, and break: stable within the same sentence or between sentences, correct break for punctuation.

  • No noise during recording.
  • No wrong accent.
  • No wrong pronunciation.

TTS Collection Workflow

While collecting data for TTS, the following workflow should be followed:

Selecting voice artist: The collection agency should send samples of several voice artists of a language and the best voice artist should be selected based on these samples.

Preparing a script: The script contains the utterances to be spoken by the voice artist. The term "utterances" encompasses both full sentences and shorter phrases. The script should cover different sentence types in your domain including statements (70%), questions (10%), exclamations (10%), short phrases (10%). Each utterance should be between 3 to 40 words.

Recording voice samples: A professional studio setup should be used for recording the voice samples. The guidelines specified earlier should be followed while recording.

Verifying voice samples: The voice data along with the text should be verified by the data collection agency. The verifier should judge the data on accuracy, noise and volume on a scale of 0 to 2 (2 being perfect). Note that accuracy here means that the audio recording should be faithful to the input script provided to the artist. Only those voice samples which have a rating of 2 on all three parameters should be accepted. The average syllable rate of the audio should be measured and ensured that it is between 6 to 10 syllables per second.

Verifying audio specifications: It should be verified that all voice samples adhere to the specifications described earlier (e.g., *.wav format, 48 kHz sampling rate, etc).

TTS Quality Assurance

It is important that the data created by the above workflow is independently verified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below. Note that there is no difference between the data collected for ASR and TTS except that in the latter case, the data is collected from a single professional voice artist in a studio setup. Hence, the same quality assurance is applicable here with the additional requirement of checking the more stringent specifications for the audio data.

Selection of certifiers: It is important that the certifiers used for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitivised about the desired quality of transcribers. DMU recommends that transcribers with at least 15 years of experience should only be considered for this task.

Verification Tool: DMU will build a certification interface in Shoonya and host it on the Bhashini platform (e.g. https://bhashini.gov.in/shoonya). This interface will allow human verifiers to see an audio-transcription pair and check for its correctness.

Creation of certification task: DMU recommends that after every 5 hours of audio data is collected and transcribed by a group (DMU or RnD projects), the transcriptions along with the audio should be uploaded on the above certification tool. Since this is training data, a 10% sample of the data will be shown to the human certifiers.

Certification: The certifiers selected for this task will check every audio-transcription pair and correct the transcriptions using the same guidelines as defined earlier.

Quality metric: Once the transcriptions are checked/edited, the tool will automatically compute the word error rate, i.e., the percentage of words which were edited.

Acceptance criteria: The uploaded batch of data will be accepted only if the word error rate is less than or equal to 5%.

Rework: In case, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification

Apart from certifying the quality of individual batches, at the end of data collection the certifier should certify if the data meets the specifications mentioned earlier (*.wav format, 24-48kHZ, voice quality, etc).

TTS Licensing Considerations

It is important that all data collected, either source data (such as source sentences used by voice artists to speak out) or derived/collected data (such as audio data collected from voice artists or any annotations done on this data), have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Permissible licenses of source sentences used in the script: While curating sentences for creating the script, it should be ensured that such sentences are curated only from sources which come under permissible licenses such as CC BY 4.0 license 9 (e.g., Wikipedia) or from sources which explicitly grant consent for free usage for all purposes.

Consent of voice artist and collection agency: Explicit signed consent should be taken from the voice artist and collection agency before recording any voice samples to ensure that in the future there are no restrictions in distributing the data freely for all purposes.

Natural Language Understanding

To train and evaluate Natural Language Understanding systems, we need annotated data for each task. The tasks of interest are (1) Named Entity Recognition, (2) Sentiment Analysis, (3) Question Answering (4) Relation Labeling and so on. In the current document, we only consider the task of NER. In a later version of this document we will consider other NLU tasks. Each task is different and we define the data collection procedure for each task separately.

Named Entity Recognition

Named Entity Recognition is concerned with identifying 4 types of common entities types in text: PERSON, ORGANIZATION, LOCATION, NUMBER (quantity, currency, unit), DATE (years, weekdays, specific dates) and TIME.

NER Specifications

The annotated data should be stored in a structured format following the standard BI (Begin-Inside) annotation scheme. In this scheme, each token is annotated with one of the following labels.

Label Description
B-PER Token which starts a PERSON named entity
I-PER Token which is part of a PERSON named entity, but not the first one
B-ORG Token which starts an ORGANIZATION named entity
I-ORG Token which is part of a ORGANIZATION named entity, but not the first one
B-LOC Token which starts a LOCATION named entity
I-LOC Token which is part of a LOCATION named entity, but not the first one
B-NUM Token which starts a Number named entity
I-NUM Token which is part of a Number named entity, but not the first one
B-DATE Token which starts a Date named entity
I-DATE Token which is part of a Date named entity, but not the first one
B-TIME Token which starts a Time named entity
I-TIME Token which is part of a Time named entity, but not the first one

This means that for every entity type, the start tokens and subsequent tokens would be identified by the entity specific tags. Following is an example of an annotated sentence.

The meeting of ABC Pvt. Ltd. was held at the Ashoka Center by Arun rao
B-
OR
G
I-
O
RG
I-
OR
G
B-
LOC
B-
LOC
B-
PER
I-
PER

Data collected for Named Entity systems should meet the following specifications:

  • Diversity in domains: The sentences should cover a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News. We recommend at least 5% of the sentences should come from each of these 12 domains.
  • Diversity in length of NEs: For every domain of interest, it is desired that the corpus should contain NEs of different lengths (e.g., 1 to 5)..
  • Downstream applicability: While collecting training data from a variety of domains would be useful, one should also focus on collecting data for building practical applications, such as translation for everyday usage/conversations
  • Compliance to formats: The collected data should be in UTF-8 encoding and be ULCA compliant. In addition the data should not contain any objectionable content.

NER Guidelines

Below we describe the guidelines to be used for named entity annotations. These guidelines are partly inspired from similar guidelines prepared by Message Understanding Conference (MUC-7) and followed by the CoNLL shared task 2003 for Named Entity Recognition, which is widely accepted in the academic community as standard NER benchmarks. The following are the relevant documents.

List of tags with associated categories of names

MUC-7 Named Entity Task Definition (Note: The temporal and numerical expressions in this document are not in the scope of the Bhashini annotation)

https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v6.6.p df

While the task of entity annotation might look straightforward, it is non-trivial to ensure consistency in annotation. As an example, titles such as "Mr." and role names such as "President" are *not* considered part of a person’s name. However, appositives such as "Jr.", "Sr.", and "III" *are* considered part of a person’s name. For consistency, these guidelines must be strictly adhered to.

We recommend that annotators read through the above mentioned guidelines before starting the annotation task. The above guidelines contain a rich set of examples which can be handy when in doubt, so we recommend looking them up for clarifications when required. One challenge which is specific to Indian languages is when Named Entities have agglutinations. In such cases, the NE needs to be segmented as shown below:

Consider the word hemayiluNtaayirunna in Malayalam which means (that which Hema has), the word will have to be segmented and only hemayil will be marked as a NE of type B-PER.

Similarly, in Madhura Kamaraj universitiyilaanu (which mean, it is a Madhura Kamraj University), the last word will have to be segmented. The correct labels would be

Madhura B-ORG

Kamaraj I-ORG

Universitiyil I-ORG

Some of the annotations might require topic-specific knowledge, and annotators are encouraged to look up the relevant topics to make decisions when there is any confusion.

Named Entities can be nested and the same should be allowed. For example, in addition to marking “Rajiv Gandhi Institute of Technology” as an ORGANIZATION entity, the nested entity “Rajiv Gandhi” should be marked as PERSON.v

NER Collection Workflow

The following workflow is recommended for creating parallel sentences:

Curation of source sentences: The source sentences (i.e., sentences in the source language) should be carefully created to meet the diversity criteria in the specifications mentioned above. It should be ensured that such source sentences are unique, i.e., (i) there are no duplicates across different RnD projects and DMU and (ii) there are no duplicates in existing sources of NER data. To do so, all projects should upload their source sentences in a common repository. It should be ensured that source sentences have at least one named entity. A rough check for the same can be made by running existing Named Entity Aligners if available to reduce wastage of annotation effort.

Verification of source sentences: Source sentences should be verified to ensure that there are no spelling errors, grammatical errors, factual errors and improper encodings. The source sentences may contain punctuations, quotes and content in brackets which should be retained as it is. During this stage it should also be ensured that the source sentences meet the desired specifications mentioned earlier (e.g., diversity in domains, lengths, etc).

Named Entity Annotation of source sentences: The annotation should be done using the guidelines described earlier.

Filtering of sentences: Sentences without a named entity should be discarded.

Verification of annotations: Each annotated sentence should be verified by a different annotator (preferably, a senior annotator). The verifier should follow the same guidelines as listed above and wherever needed, edit the annotations to make them perfect.

NER Quality Assurance

It is important that the data created by the above workflow is independently verified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below.

Selection of verifiers: It is important that the verifiers used for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitivised about the desired quality of annotations. DMU recommends that anntators with at least 5 years of experience should only be considered for this task.

Verification Tool: DMU will build a verification interface in Shoonya and host it on the Bhashini platform (e.g. https://bhashini.gov.in/shoonya). This interface will allow human verifiers to see an annotated sentence and check for its correctness.

Creation of verification task: DMU recommends that every batch of 1000 sentences annotated by a group (DMU or RnD projects), should be uploaded on the above verification tool. In addition, the group should also specify whether this is training data or benchmark data. For training data, the tool will randomly sample 10% of the data and show it to the human certifiers. For verification data, the tools will show the entire data 100 (100%) to the human certifiers.

Verification: The verifiers selected for this task will see a sample of the data and for every annotated sentence in the sample they will make a binary judgment on whether it is correct or not. While doing so, they will adhere to the same guidelines as defined earlier. In addition, the tool will also allow the verifiers to make edits to the annotations to make them perfect.

Quality metric: Once the sentences are verified the tool will automatically compute the percentage of tags that were edited (insertions, deletions, substitutions).

Acceptance criteria: The uploaded batch of sentences will be accepted only if the percentage of tags that were edited is less than 10%.

Rework: In case, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification.

NER Licensing Considerations

It is important that all data collected, either source data (such as sentences for labeling NLU features) or derived data (such as annotations on the text), have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Permissible licenses of source sentences: While curating sentences for annotation, it should be ensured that such sentences are curated only from sources which come under permissible licenses such as CC BY 4.0 license (e.g., Wikipedia) or from sources which explicitly grant consent for free usage for all purposes. .

Consent of annotators and/or data collection agencies: We expect that most annotations will be done by in-house annotators or external data collection agencies. Explicit consent of in-house annotators and data collection agencies should be taken to ensure that in the future there are no restrictions in distributing the data to all stakeholders in the language technology ecosystem.

Named Entity Recognition is concerned with identifying 4 types of common entities types in text: PERSON, ORGANIZATION, LOCATION, NUMBER (quantity, currency, unit), DATE (years, weekdays, specific dates) and TIME.

Sentimental Analysis

SA Specifications

To train and evaluate Sentiment Analysis systems for a given language pair we need reviews (sentences/paragraphs) with an associated label (positive, negative, neutral). Such data should have the following characteristics

Diversity in domains: The reviews should be about products from a wide variety of domains such as, sentences/paragraphs should cover a wide variety of domains. We have identified the following domains from popular e-commerce websites: Baby Products, Education, Electronics, Entertainment, Fashion, Food, Health/Wellness, Hospitality, Pets, Sports/Games, Transportation, Travel ,Vehicles, Furniture. We recommend at least 200 reviews from each of these domains.

Diversity in aspects/attributes: For a given product, there should be some diversity in the aspects/attributes that the review talks about (e.g., different reviews for phones should cover different aspects such as screen size, battery life, camera, audio quality, etc). An exhaustive list of attributes for each product should be created and one or more attributes should be randomly sampled while writing the review.

Distribution of labels: The data should contain good representation of negative, positive and neutral reviews. For every domain, we recommend a minimum of 25% of reviews belonging to each of these 3 labels.

Diversity in lengths: The reviews should be of different lengths as counted in words. These could range from very short reviews which have 4-5 words to longer reviews which have 50-60 words. We recommend at least 10% short reviews (4-5 words) and 10% long reviews (> 60 words)

Diversity in writing styles: Reviews are often written in a very informal style. Such reviews should have a good representation in data. We recommend at least 25% of the reviews should be written in informal style.

SA Guidelines

One approach for creating sentiment labeled data for Indian languages would be to take English data and then translate it to multiple Indian languages. We are avoiding this approach for 2 reasons: (i) such data may have reviews written in a Western context and may not be completely relevant for India (ii) such data may have been crawled from e-commerce websites and hence may have copyright issues. Hence, we recommend creating the data in-house using the following guidelines.

Each review should be written based on a specific prompt. An example of a prompt would be {product = phone, aspects = [audio quality, screen size], model = Samsung Galaxy M33, label = positive, writing style = formal}. The reviews should completely adhere to the specifications given in the prompt.

The following meta-data associated with the review should be stored: product, aspect(s), label (positive/negative/neutral), writing-style (formal/informal)

The reviews should be natural. For example, a glowing positive review for the camera of a low-end phone may not look natural. Annotators should look at English reviews of the product to get a gist of the comments relevant for this product.

The reviews should be written by in-house annotators and should not contain content which is copied verbatim from online sources.

The collection of reviews should adhere to the different diversity criteria mentioned in the specifications above.

SA Collection Workflow

The following workflow will be used for creating data for sentiment analysis.

Curation of products: A list of 1000 products belonging to the different domains mentioned earlier will be curated from different e-commerce websites.

Curation of aspects: For each of these products a list of 3-5 aspects will be created manually (e.g., phone → camera, screen size, battery life, audio, mic)

Curations of brands/models: For each of these products a list of 3-5 relevant brands/models will be created manually by referring to product catalogs on e-commerce websites

Creation of prompts: Based on the above data prompts of the form {product, [aspects], brand, label, style} will be created for writing reviews (e.g., {phone, [audio quality, screen size], Samsung Galaxy M33, positive, formal}.

Creation of reviews: Based on the prompt, human annotators will write a review by referring to other reviews written online. Human annotators can skip a prompt if they feel that a natural review cannot be written. For example, based on information available online if it is clear that a phone does not have a good camera then it would be unnatural to write a positive review and the annotator could skip such a prompt. Note that the information available online will only be used as a reference and not copied verbatim.

Verification of reviews: Once a review has been created, an independent reviewer will verify that the review is faithful to the given prompt and is natural.

SA Quality Assurance

It is important that the data created by the above workflow is independently verified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below.

Selection of certifiers: It is important that the certifiers used for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitized about the desired quality of annotators. DMU recommends that annotators with at least 5 years of experience in creating/annotating sentences should only be considered for this task.

Verification Tool: DMU will build a certification interface in Shoonya and host it on the Bhashini platform (e.g. https://bhashini.gov.in/shoonya). This interface will allow human verifiers to see a review and its annotations and check for its correctness.

Creation of certification task: DMU recommends that after every 1000 reviews are collected and annotated by a group (DMU or RnD projects), the reviews along with the annotations/meta-data should be uploaded on the above certification tool. In addition, the group should also specify whether this is training data or benchmark data. For training data, the tool will randomly sample 10% of the data and show it to the human certifiers. For verification data, the tools will show the entire data 100 (100%) to the human certifiers.

Certification: The certifiers selected for this task will see a sample of the data and for every review-annotation pair they will check the overall correctness of the annotation. If the review does not match any of the meta-data (i.e, the review is not written in accordance with the prompt) then the review should be marked as incorrect.

Quality metric: Once the annotations are checked, the fraction of reviews which were marked as incorrect is computed.

Acceptance criteria: The uploaded batch of data will be accepted only if the error rate is less than or equal to 5% for both training data as well as for benchmark data.

Rework: In case, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification.

SA Licensing Considerations

It is important that all data collected should have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Consent of annotators and/or data collection agencies: We expect that most annotations will be done by in-house annotators or external data collection agencies. Explicit consent of in-house annotators and data collection agencies should be taken to ensure that in the future there are no restrictions in distributing the data to all stakeholders in the language technology ecosystem.

Question Answering

In this project, we will focus on span based QA wherein the goal is to identify the answer to a given Q from a given paragraph or label the question as answerable. To train and evaluate QA systems, we need the following (i) paragraphs (ii) questions created from these paragraphs and (iii) a label wherein the label could be a span in the paragraph which answers the question or the tag “unanswerable” if the question cannot be answered from the paragraph. An alternative to this would have been to follow the procedure used for creating TydiQA dataset wherein the annotators are shown a hint about a topic and then create natural Qs for those topics. A set of passages relevant to the Q are then automatically fetched using a retrieval engine. A different set of annotators are then asked to mark the answer in these retrieved passages, if it exists or else marks the question as answerable. The challenge with this approach is that for many Indian languages there may not be enough articles matching a question available online.

QA Specifications

The data consisting of passages, questions and answers should meet the following passages:

Diversity in question types: The data should contain questions of different types: Who, What, When, Where, What, Yes/No. We recommend at least 10% questions should belong to each of these categories.

Diversity in domains: The passages from which questions are created should come from a wide variety of domains such as Legal/Governance, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News. The exact split across these domains is hard to specify as the availability of paragraphs belonging to each of these domains may vary across languages.

Diversity in answer types: The answers should belong to a wide variety of types such as, (i) named entities like person, location, organization, date number (ii) common noun phrase (iii) adjective phrase (iv) verb phrase and (v) Yes/No (vi) Unanswerable. It is recommended that at least 5% of the Qs belong to each of the first 5 categories and at least 20% of the questions belong to the unanswerable category.

QA Guidelines

We will follow the same guidelines as used for the SQuAD2.0 dataset

QA Collection Workflow

The following workflow will be used for creating data for QA.

Curation/Creation of paragraphs: Paragraphs need to be curated from Wikipedia articles belonging to different domains as mentioned in the specifications. For some Indian languages, where enough data is not available on Wikipedia, passages from other pivot languages (say, Hindi, Tamil, Bengali) will be translated to these target languages.

Creation of Questions: The annotators will read the paragraphs and create interesting Questions from them. As mentioned in the guidelines, Questions should be phrased such that they do not have a significant overlap with the vocabulary used in the paragraph. For creating unanswerable questions, the annotators can be shown a few other sentences from the same article and asked to create questions for them. This would ensure that these questions are related to the passage (same topic) but are unanswerable from it.

Annotating answers: An independent set of annotators will verify the questions and mark the spans in the passage if the question is answerable from the passage, else label the question as unanswerable.

Verification of QA pairs: An independent set of annotators will check all the QA pairs and ensure that the created question-answer pairs are correct and adhere to the guidelines.

QA Quality Assurance

It is important that the data created by the above workflow is independently verified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below.

Selection of certifiers: It is important that the certifiers used for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitized about the desired quality of annotators. DMU recommends that annotators with at least 5 years of experience in creating/annotating questions and answers should only be considered for this task.

Verification Tool: DMU will build a certification interface in Shoonya and host it on the Bhashini platform (e.g. https://bhashini.gov.in/shoonya). This interface will allow human verifiers to see a review and its annotations and check for its correctness.

Creation of certification task: DMU recommends that after every 1000 QA pairs are collected and annotated by a group (DMU or RnD projects), the reviews along with the annotations/meta-data should be uploaded on the above certification tool. In addition, the group should also specify whether this is training data or benchmark data. For training data, the tool will randomly sample 10% of the data and show it to the human certifiers. For verification data, the tools will show the entire data 100 (100%) to the human certifiers.

Certification: The certifiers selected for this task will see a sample of the data and for every question-answer pair they will check the overall correctness of the pair as per the guidelines mentioned earlier. If the QA pair is not created in accordance with the meta-data then that pair should be marked as incorrect

Certification: The certifiers selected for this task will see a sample of the data and for every question-answer pair they will check the overall correctness of the pair as per the guidelines mentioned earlier. If the QA pair is not created in accordance with the meta-data then that pair should be marked as incorrect

Acceptance criteria: The uploaded batch of data will be accepted only if the error rate is less than or equal to 5% for both training data as well as for benchmark data.

Rework: In case, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification.

QA Licensing Considerations

It is important that all data collected should have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Consent of annotators and/or data collection agencies: We expect that most annotations will be done by in-house annotators or external data collection agencies. Explicit consent of in-house annotators and data collection agencies should be taken to ensure that in the future there are no restrictions in distributing the data to all stakeholders in the language technology ecosystem.

OCR Specifications

Wefocus on twoseparate tasks here. The first is to detect the layout of a document and the second is to detect text in images taken in natural settings (e.g., image of a banner on a shop)

Layout Detection

Here, we will focus on the task of detecting the layout of a document. For this, we need pdf pages where the different parts of the page are clearly highlighted by a bounding box and the order of the boxes is also specified. The pdf pages that we select should meet the following specifications:

Diversity in font sizes: The page should contain text written in a wide variety of font sizes.

  • At least 10% of the images should have some content in font size 6- 10
  • At least 10% of the images should have some content in font size 11- 16
  • At least 10% of the images should have some content in font size 16- 20
  • At least 10% of the images should have some content in font size > 20

Diversity in font types: The page should contain text written in a wide variety of font types. At least 10% of the images should have content written in fonts belonging to each of the following typeface families: serif, sans-serif, script (formal, casual, handwritten, calligraphic), monospaced and display

Diversity in layouts: The page should have a variety of layouts

  • At least 5% in one column format
  • At least 5% in two column format
  • At least 5% in newspaper style
  • At least 5% receipts and invoices
  • At least 5% in magazine formats
  • At least 5% in the format of supreme court judgements
  • At least 5% in the format of prescriptions
  • At least 5% in the format of official government letters (with letterhead)

Diversity in artifacts: The page should contain a variety of artifacts such as tables, figures, indentations, sections, sub-sections and bullet lists. It would be too restrictive to specify the exact percentage of each of these artifacts. Hence, only a general guideline of ensuring diversity in artifacts is specified.

Diversity in background effects: The scanned pages should have a variety of background effects such as crumbling, lighting effects, scan marks, page fold marks, etc. It would be too restrictive to specify the exact percentage of each of these background effects. Hence, only a general guideline of ensuring diversity in background effects is specified.

Complianceto formats: The images should be in jpeg format and the annotations done on them (bounding boxes to identify layouts) should be in ULCA compliant format.

Scene Optical Character Recognition Here, we will focus on the task of detecting text in images of natural scenes (such as photos taken by a user’s camera). For training and evaluating such a system we need natural images containing text where the area containing the text is highlighted by a rectangular bounding box and the text inside is typed out in UTF-8 encoding. Such data should have the following characteristics.

Diversity in background: The images containing text should have a very diverse background such as buildings, sky, trees, etc.

Diversity in objects: The images containing text should have a very diverse set of objects, such as, shop signs, notice boards, bulletins, banners, flyers, street walls and vehicles.

Diversity in angles: The images should be taken from different angles (left, right, top, bottom, etc.)

Diversity in font types: The images should contain text written in a wide variety of fonts.

Diversity in font sizes: The images should contain text written in a wide variety of sizes.

Diversity in orientation: The images should contain text with different orientations (horizontal, slanting, circular, etc.).

Diversity in ambient light: The images should be taken under different lighting conditions.

Diversity in cameras: The images should be taken by at least 20 different phones, 5 each belonging to the following price ranges: 5K to 10K, 10K to 15K, 15K to 25K, above 25K.

Note that since these are natural images it is hard to specify a very detailed distribution of different properties. For example, for some languages we may not find banners or posters containing very small font sizes or a wide variety of font types. Hence, only a general guideline of ensuring diversity in the collected data is specified. This should be monitored periodically by the Expert Committee and appropriate course correction should be made, if needed.

Printed and Handwriting Optical Character Recognition

The goal is to define the data and annotations required for immediately enabling Research & Development and demonstrating the proof-of-concept applications as (A). And define the eventual data required for rolling out robust applications as (B). Here (A) may be thought of as immediate and (B) may be thought of as eventual.

Printed:

2L words from 100-150 different sources (books, magazines, newspapers, articles, etc.) per language (in total 13 languages, Assamese, Bengali, Hindi, Gujarati, Kannada, Malayalam, Marathi, Manipuri, Odia, Punjabi, Tamil, Telugu, and Urdu).

5L words from 500 different sources per language Offline Handwritten:

5L Words from 150 writers per language

10L words from 500 writers per language

Application Oriented dataset

The goal is to annotate some special application focused collections for enabling practical applications in the near future. Approximately 1000 documents in limited (one to three) languages aiming at five typical use cases/applications is required for this. (A total of 5000 images)

Forms with hand written Indian language content.

Accounting and Business documents such as Invoices and Receipts

Correspondences such as Letters and Memos

Official Records such as ID cards

Hybrid Reports such as partly Handwritten Prescriptions and Medical Report

Additional Data Resources

In addition to the above image data, developing a high accuracy recognizer will require (i) monolingual text corpus for Language Modeling, and similar tasks and (ii) domain specific Lexicons and Linguistic Resources.

OCR Guidelines

Wenowdescribe the guidelines for collecting images for scene text recognition.

General principles

Thetextinthe image should be clearly readable by a human.

Thetextcontent should largely be in the target language.

Avoidtaking a picture of the same scene in different conditions (zoom in/out, natural/artificial lighting, etc.).

Ensurediversity in content and conditions as mentioned above.

Ensurethere is diversity in the cameras using which the images are clicked (different megapixels, different brands, etc.).

Theimagesshould only contain printed text as opposed to handwritten text.

Fortypical printed, handwritten and hybrid documents, flatbed (typical scanning) and an equivalent modern camera-based scanning (with dedicated app such as cam Scanner) is assumed. For printed or handwritten documents, a good number of documents could be scanned by camera.

Inputtotypical OCRs could be highly diverse. Therefore, diversity in the data is more important than the quantity of the data. Diversity could be thought of as (i) number of distinct writers for handwriting (ii) number of sources (books, publishers, type of documents) (iii) illumination, pose and imaging conditions for scene text etc.

Annotating bounding boxes

Forlayout detection, a bounding box should be drawn around each segment of text. A segment here would be a continuous unit of text such as a paragraph, a caption, a table, a collection of bullet points, a heading and so on.

Itshould be ensured that two bounding boxes do not overlap with each other

Forscenetext recognition, a bounding box should be drawn around each segment of text. Asegment of text here would be a coherent and complete piece of text such as, a name, a message, an address, a caption, etc. If a coherent, continuous text is spread across two lines then it is recommended that a separate bounding box is drawn for each line.

Oncetheboundingboxhasbeen marked, the annotators should be allowed to type in the text corresponding to each bounding box.

OCR Collection Workflow

The following workflow is recommended for creating data for document layout detection:

  • Curation of templates: For document layout detection, it is recommended that we create configurable templates for different layouts (one column, two column, magazine, receipts, etc). These templates can then be populated with text of different font styles and sizes to generate a wide variety of PDF documents. This text could be obtained from public courses which come under CCBY 4.0 license. This template based approach is being suggested as for many languages, documents in a wide variety of formats would not be available.
  • Creation of images: Once the PDFs have been generated, they can be converted to images and appropriate noise effects can be added (blurring, crumbling, scanning, etc)
  • Verification of images: The images will then be verified by humans to ensure that they look like real document images. The images which do not meet this criteria will be discarded.
  • Annotation of layouts: The verified images will be shown to annotators to mark the layout as well as the order of different components in the layout (e.g., header followed by text followed by table). The guidelines listed earlier will be followed while making these annotations.
  • Verification of annotations: The annotations will then be verified by an independent verifier using the same guidelines mentioned earlier. The verifier can make edits as needed.

The following workflow is recommended for creating data for scene text recognition:

  • Curation of images: Natural scene images containing text can be curated from the web provided such images come under CCBY 4.0 license. However, for many languages such images may not be available publicly and hence human photographers will have to be given the task of clicking such natural images. While doing so they should adhere to the guidelines mentioned earlier (e.g., click images of scenes containing text).
  • Verification of images: The images will then be verified by humans to ensure that the text is legible to an average human. All other images will be discarded.
  • Annotation of images: The verified images will be shown to annotators to mark bounding boxes around text as well as enter the text corresponding to each bounding box. The guidelines listed earlier will be followed while making these annotations.
  • Verification of annotations: The bounding boxes and corresponding text will then be verified by an independent verifier using the same guidelines mentioned earlier. The verifier can make edits as needed.

The following workflow is recommended for creating data for Handwriting text recognition:

  • Curation of images: Per language, 100-150 users/writers write handwritten documents based on corresponding text corpus.
  • Verification of images: The images will then be verified by humans to ensure that the text is legible to an average human. All other images will be discarded.
  • Annotation of images: The verified images will be shown to annotators to mark bounding boxes around text as well as enter the text corresponding to each bounding box. The guidelines listed earlier will be followed while making these annotations.
  • Verification of annotations: The bounding boxes and corresponding text will then be verified by an independent verifier using the same guidelines mentioned earlier. The verifier can make edits as needed.

The following workflow is recommended for creating data for Printed text recognition:

  • Curation of images: Per language, printed document images are collected from 100-150 different sources, like books, magazines, newspapers, articles, etc. Images may be scanned by a flatbed scanner and camera.
  • Verification of images: The images will then be verified by humans to ensure that the text is legible to an average human. All other images will be discarded.
  • Annotation of images: The verified images will be shown to annotators to mark bounding boxes around text as well as enter the text corresponding to each bounding box. The guidelines listed earlier will be followed while making these annotations.
  • Verification of annotations: The bounding boxes and corresponding text will then be verified by an independent verifier using the same guidelines mentioned earlier. The verifier can make edits as needed.

OCR Quality Assurance

It is important that the data created by the above workflow is independently verified by an external entity. The DMU recommends empaneling a few data collection start-ups who would verify the data and certify it only if it satisfies the acceptance criteria listed below. The DMU further recommends that the entire process be managed by an independent entity (neither DMU nor RnD groups). We will refer to this entity as the certifier. DMU will provide the necessary tools to facilitate this process as described below.

Selection of certifiers: It is important that the certifiers used for certifying the data are of the highest quality. While empaneling vendors for this task, they should be suitably sensitized about the desired quality of annotators. DMU recommends that annotators with at least 15 years of experience should only be considered for this task.

Verification Tool: DMU will build a certification interface in Shoonya and host it on the Bhashini platform (e.g. https://bhashini.gov.in/shoonya). This interface will allow human verifiers to see an image and its annotations and check for its correctness.

Creation of certification task: DMU recommends that after every 1000 images are collected and annotated by a group (DMU or RnD projects), the images along with the annotations should be uploaded on the above certification tool. In addition, the group should also specify whether this is training data or benchmark data. For training data, the tool will randomly sample 10% of the data and show it to the human certifiers. For verification data, the tools will show the entire data 100 (100%) to the human certifiers.

Certification: The certifiers selected for this task will see a sample of the data and for every image-annotation pair they will check the overall correctness of the annotation. They will delete, modify or add bounding boxes if the original bounding boxes were incorrect. Similarly, they will edit the text entered for each bounding box if it is found to be incorrect.

Quality metric: Once the annotations are checked, the fraction of bounding box annotations which were added, deleted or modified is computed. Similarly, the fraction of words that were edited is computed.

Acceptance criteria: The uploaded batch of data will be accepted only if both the word error rate and the bounding box error rate is less than or equal to 10% for training data and less than or equal to 5% for benchmark data.

Rework:Incase, a batch is not accepted, the group (DMU or R&D team) will rework the batch and submit a new application of verification. During this step, a fresh random sample is chosen for certification.

Apart from certifying the quality of individual batches, at the end of data collection the certifier should certify if the data meets the specifications mentioned earlier using a 10% random sample (at least 5% for each font size, different layouts, etc.)

OCR Licensing Considerations

It is important that all data collected, either source data (such as source images) or derived data (such as annotations on images), have permissible licenses to enable the widest possible use-cases of the created language resources. To enable this, the following licensing conditions are suggested for these types of data.

Permissible licenses of source image: While curating images for annotation, it should be ensured that such sentences are curated only from sources which come under permissible licenses such as CC BY 4.0 license (e.g., Wikipedia).

Consentofphotographers, annotators and/or data collection agencies: We expect that many photos may be collected by photographers hired by data collection agencies. Similarly, we expect the annotations will be done by in-house annotators or external data collection agencies. Explicit consent of photographers, in-house annotators and data collection agencies should be taken to ensure that in the future there are no restrictions in distributing the data to all stakeholders in the language technology ecosystem.