Getting Modular with Language Models: Building and Reusing a Library of Experts for Task Generalization
Presented by Alessandro Sordoni at Microsoft Research Forum, March 2024
“We have witnessed basically the wide adoption of large language models, such as GPT-4, that have very broad capabilities and can be used to solve a variety of tasks. But they are, kind of, expensive to serve, and somehow, we can actually think and ask ourselves, are they really necessary for most tasks that users … might need?”
– Alessandro Sordoni, Principal Researcher, Microsoft Research Montreal
- Microsoft research copilot experience What are expert language models?
Transcript: Lightning Talk 3
Getting modular with language models: Building and reusing a library of experts for task generalization
Alessandro Sordoni, Principal Researcher, Microsoft Research Montréal
Alessandro Sordoni shares recent efforts on building and reusing large collections of expert language models to improve zero-shot and few-shot generalization to unseen tasks.
Microsoft Research Forum, March 5, 2024
Hi, everyone. My name is Alessandro. I’m from Montréal. I’m going to share with you a vision for building, sort of, modular language models.
So we have witnessed basically the wide adoption of large language models, such as GPT-4, that have very broad capabilities and can be used to solve a variety of tasks. But they are, kind of, expensive to serve, and somehow, we can actually think and ask ourselves, are they really necessary for most tasks that users for Microsoft, for example, might need?
This has basically boosted the development of small language models—for example, the mighty and very powerful Phi-2—that can be adapted to user tasks, right. Either with full fine-tuning, which means we change all the parameters of the model, or with parameter-efficient adaptation, for example, by training LoRAs, which only change a small amount of the parameters of the model. And basically, if we can see these experts, basically these model adapters, as experts at their own tasks, right … and this is great because we have now a cost-effective model that can solve the tasks very effectively. But the problem now is that we have only very narrow capabilities. So the question that we ask here is that now that we have all these experts models for each user, for each task, can we actually reuse the experts models for either building small models that have broader capability or for adapting to new users and tasks more efficiently?
Let me show you how this system could work. So we start from a base model, which is Phi-2, and we adapt this base model for every user or for a set of tasks, and we group these, sort of, adapters into a library. And now we, kind of, come up with a, sort of, orchestration mechanism that chooses which adapters to use based on a new user query to produce a system response. The system has, sort of, some desirable properties. One is that it enhances the base language model capabilities via such an expert composition, and this resembles, a little bit, how mixture of experts work, right. But there is a big difference here. It’s that these experts are not trained with the base model itself, but they are trained a posteriori. This leads us to the second point, which is basically a, sort of, decentralized training of these LoRA experts, and this is good because somehow users … for example, this preserves privacy in the sense that we do not need—we do not require—all the data to be shared at once and to always retrain the base model from scratch. And second, energy efficiency. These LoRA experts are very energy efficient. And so basically, we can train it very quickly. The third point is interpretability because these experts are basically associated with the task, usually, that they can solve. And so upon seeing a new user query, we can actually inspect which expert has been activated for that user query, and we can get a little bit of a sense of which actually capabilities are required.
So in order to build this system, we have to answer two questions. One, how do we build such an expert library? And second, how do we select the relevant experts for these new inputs? So the first basically scenario that we are dealing with is a, sort of, private scenario in the sense that we have an ensemble of datasets which are tasks or user data. And in the private scenario, we assume that we cannot share data for these tasks, OK. We cannot train all the data together. And so basically, one standard approach is to fine-tune, for example, a LoRA adapter on each dataset independently. Here in the figure, we are going to end up with a library with three experts. But let’s say that we can actually share a certain amount of this data, for example, if we are dealing with public tasks or stuff like that. So the idea here around this approach is to basically form a, sort of, clustering of these tasks and just train an adapter for each cluster. How do we cluster these tasks actually? We basically do a, sort of, a private LoRA fine-tuning for a few steps at the beginning, to get just a weight for each task, right. And then we cluster the weights of each task by their similarity, and we group tasks together that have high weight similarity. And we train one adapter per task. So at the end, we are going to end up with a library of two experts. This basically relies on the intuition that the similarity in the weight space for these tasks reflects how synergistic these tasks are when adapters are trained on the joint dataset.
Now that we have our great library of experts, we have to choose how to actually select them upon seeing a new input, OK. Here we assume that we do not have access to the data that’s been used to train the experts. So basically, you trained your experts, you gave it to us, and we figure out how to use it. So here to do routing, we basically select which expert to use for each representation in the base model. The base model is a transformer. We have a representation for each layer and for each token. And so basically, we route by dot products between each hidden states in the transformer and an expert representation. But now we have to come up with an expert representation, and we do not know which data these experts have been trained on. Here, to do so, we leverage basically the functional form of these LoRA adapters, which basically produces a linear shift on the hidden state’s representations. And so we, basically, we take the linear transform of the LoRA adapter, we decompose it into singular directions, and we take the top singular direction of that matrix. That gives us our expert representation. We stack those expert representations in the matrix, in a routing matrix. We compute the dot product, much similar to the mixture of experts, kind of, sort of, parameterization, and we choose which experts we use based on the scores obtained by that.
Here, the idea is simple, is that that singular direction gives us a sense of how the hidden states for that expert looked like when that expert was training. So in order to test basically our system here, we assume that we have some data for tasks available, and we use, like, FLAN data, which is just natural language tasks. We evaluate our systems on a set of these 10 tasks used to evaluate Phi-2, and these tasks range from commonsense reasoning, code, BBH—BIG-Bench Hard—et cetera. And so basically these are some results that we obtained in the recent submission, is that we have, like, Phi-2, the first part, which gets out of the box around 64, and then we actually fine-tuned Phi-2 on our own multitask dataset, and this gets a boost of around 65.5. And basically, this approach assumes that we can train on all data, right. And then we have our first dot, which is “Private + Arrow.” And so basically we … private, as I remember, it trains experts independently—256 tasks—and then there is post-hoc routing. And here it was very surprising to us that we can actually get some good performance even with this method.
But if we go further, we assume some sort of selective data sharing and that we have our clustering approach and then the route on top of that, we can get even further gains. And this last method— “MBC + Arrow”—actually adds only 22 million parameters to the model.
So looking forward, I believe that an exciting direction would be really to push this to fully decentralized training and continual improvement of language models in the sense people can train their experts, they give it to the platform, and model gets better. The other point is a heterogeneous library of adapters in the sense that we can actually add different sorts of adapters into this library, each with its own inductive biases, and so we can expand even more the capabilities.
Thank you very much.