PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models
- Ishaan Watts ,
- Varun Gumma ,
- Aditya Yadavalli ,
- Vivek Seshadri ,
- Swami Manohar ,
- Sunayana Sitaram
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. Hence, it is difficult to do extensive evaluation of LLMs in the multilingual setting, leading to lack of fair comparisons between models and difficulties in replicating the evaluation setup used by some models. Recently, several Indic (Indian language) LLMs have been created as an answer to a call to build more locally and culturally relevant LLMs. Our evaluation framework, named Pariksha, is the first comprehensive evaluation of Indic LLMs that uses a combination of Human and LLM-based evaluation. We conduct a total of 90k human evaluations and 50k LLM-based evaluations of 29 models to present leaderboards for 10 Indic languages. Pariksha not only provides inclusive and democratic evaluation by engaging a community of workers that represent the average Indian, but also serves as a research platform for improving the process of evaluation. By releasing all evaluation artifacts, we will make the evaluation process completely transparent. By conducting Pariksha at regular intervals, we aim to provide the Indic LLM community with a dynamic, evolving evaluation platform, enabling models to improve over time with insights and artifacts from our evaluations.
Research Forum 3 | Panel: Generative AI for Global Impact: Challenges and Opportunities
Microsoft researchers discuss the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.