EMT: End To End Model Training for MSR Machine Translation

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning |

Publication

Machine translation, at its core, is a Machine Learning (ML) problem that involves learning language translation by looking at large amounts of parallel data i.e. translations of the same dataset in two or more languages. If we have parallel data between languages L1 and L2, we can build translation systems between these two languages. When training a complete system, we train several different models, each containing a different type of information about either one of the languages or the relationship between the two. We end up training thousands of models to support hundreds of languages. In this article, we explain our end to end architecture of automatically training and deploying models at scale. The goal of this project is to create a fully automated system responsible for gathering new data, training systems, and shipping them to production with little or no guidance from an administrator. By using the ever changing and always expanding contents of the web, we have a system that can quietly improve our existing systems over time. In this article, we detail the architecture and talk about the various problems and the solutions we arrived upon. Finally, we demonstrate experiments and data showing the impact of our work. Specifically, this system has enabled us to ship much more frequently and eliminate human errors which happen when running repetitive tasks. The principles of this pipeline can be applied to any ML training and deployment system.