Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Sanchit Ahuja; Divyanshu Aggarwal; Varun Gumma; Ishaan Watts; Ashutosh Sathe; Millicent Ochieng; Rishav Hada; Prachi Jain; Mohamed Ahmed; Kalika Bali; Sunayana Sitaram

Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Sanchit Ahuja ,
Divyanshu Aggarwal ,
Varun Gumma ,
Ishaan Watts ,
Ashutosh Sathe ,
Millicent Ochieng ,
Rishav Hada ,
Prachi Jain ,
Mohamed Ahmed ,
Kalika Bali ,
Sunayana Sitaram

North American Chapter of the Association for Computational Linguistics | March 2024

Published by NAACL 2024

Download BibTex

Recently, there has been a surge in LLM evaluation research to comprehend LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, and Llama2) by comparing them on the same set of multilingual datasets. Our benchmark comprises datasets covering languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Our experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.