Kaveri Anuranjana | The Benchmarking Conference | LDC-IL

Implicature Benchmark for Hindi

Kaveri Anuranjana

Research Scholar
IIIT-Hyderabad


Aurthors : Kaveri Anuranjana, Amit Shukla, *Srihitha Mallepally, *Mareddy Sri Harshitha, Prof. Radhika Mamidi

Abstract

According to the cooperative principle (Grice 1975), utterances in a conversation adhere to the four maxims of conversational implicature – quantity, relation, quality and manner. An implicature arises when a maxim is flouted or contradicted, and the speaker is forced to infer a meaning. Consider the example:

Speaker A: का िदलीप को परीकाएं आसान लग रही है?
Kya Dilip ko parikshayein aasan lag rahi hain?
(Does Dilip find the exam easy?)

Speaker B: आज-कल वो सो नही पा रहा |
[Aaj-kal wo so nahi pa raha.]
(He hasn’t been able to sleep lately.)

Speaker B flouts the maxim of relevance by not answering the question with a yes or no. Instead, his reply that Dilip has trouble sleeping implies that he doesn’t find the exams to be easy.

While computational approaches have made some progress in English implicature benchmarks - GRICE (Zheng, 2021), IMPRESS (Jeretic 2020) and Conversational Implicature benchmark (CIB) (George and Mamidi, 2020), there are no benchmarks for Hindi. Hence, we propose a Hindi implicature benchmark-

Indirect Questions Hi Implicature Benchmark - Toledo-Ronen (2020) found that translation- based methods lose nuances in argument evaluation. XCOPA benchmark (Ponti 2020) & Huang (2024) demonstrated that translated data hurts model performance for structurally divergent non- English languages. Hindi is a free word order, morphologically rich language. It is structurally different from English hence, translating English implicature benchmarks may not be unsuitable.

Wang (2024) present indirect answers to yes-no questions in dialogues benchmark. Similar to their approach, we plan to collect Hindi interviews and extract indirect questions along with their answers involving implicature. We will annotate 5000 such questions. We will maintain a constraint that previous dialogues will be added to the question as a background to simplify the inference over a single dialogue turn. Evaluation will rely on ROUGE scores to measure correctness, as we leave exploration of suitable metrics for future work. For model evaluation, we plan to leverage a) English LLMs (meta-llama/Llama-3.3-70B (Grattafiori 2024) and the current state-of-the-art reported by Sravanthi (2024), FlanT5-XXL (Chung 2022) finetuned on Hindi translated GRICE (Zheng 2021), IMPRESS (Jeretic 2020) and CIB (George and Mamidi 2020) and b) Hindi LLMs (Llama-3-Nanda-10B-Chat (Choudhury 2024) and Airavata (Gala 2024)).