Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

North American Chapter of the Association for Computational Linguistics |

Published by NAACL 2024

Recently, there has been a surge in LLM evaluation research to comprehend LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, and Llama2) by comparing them on the same set of multilingual datasets. Our benchmark comprises datasets covering languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Our experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.