Knowledge Distillation

Modern machine learning applications have enjoyed a great boost utilizing deep and large neural network models, allowing them to achieve state-of-the-art results on a wide range of tasks such as question-answering, conversational AI, search and recommendation. A significant challenge facing practitioners is how to deploy these huge models in practice. Recent pre-trained language models like Turing-NLG and GPT-3 boast of a massive 17 billion and 175 billion parameters, respectively. Although they obtain superior performance in several tasks, they are too slow and expensive to use in practice.  In this project, we develop techniques to compress these huge Multilingual pre-TRainEd ModEls (XTREME) into shallow and simpler ones that are easier and faster to use while retaining the performance of the huge model.