The vision of the researchers at the Microsoft Research Montreal lab is to create machines that can comprehend, reason and communicate with humans. As part of this vision, our dialogue team has been doing research on task-oriented dialogue systems. We had earlier proposed the lexicalized delexicalized – semantically controlled – LSTM (ld-sc-LSTM) model for Natural Language Generation (opens in new tab) (NLG) which outperformed state-of-the-art delexicalized approaches.
In this new work, we perform an empirical study to explore the relevance of unsupervised metrics for the evaluation of goal-oriented NLG. The NLG task for goal-oriented dialogue systems consists of translating from dialogue acts to text. A dialogue act encodes the intent of the speaker, it can be seen as a compact representation of a sentence.
An example is:
Dialogue Acts: inform(food=Indian, count=2, area=downtown, city=Toronto), request(budget)
Text: There are 2 restaurants serving Indian food in downtown Toronto. What is your budget?
We also introduce a slight variant of the ld-sc-LSTM called the hierarchical lexicalized delexicalized – semantically controlled – LSTM (hld-sc-LSTM) model. This model is further described in our paper (opens in new tab).
The problem of evaluation
A dialogue can progress in many valid directions and a sentence can be expressed in many ways. For instance, a valid sentence for the previous example could also be ‘I found 2 Indian restaurants in downtown Toronto. How much would you like to spend?’. However, evaluating a sentence generated by NLG is often performed by comparing this sentence to the sentence given in the dataset. If we were to compare the words used in this text and the one above, we would find that the two texts are very different. Yet, they are both valid translations of the dialogue acts.
Many of the currently used automated unsupervised metrics rely on either measuring word-overlap or embedding similarity between generated responses and ground truth responses (i.e., the ones in the dataset). Previous work in non-goal-oriented dialogue (opens in new tab) (e.g., general chatbots) has shown that these automated metrics do not correlate well with human evaluation. A human would say that the two texts are valid and equally good but if we compare the word-overlap, we will find that this is not the case.
Unsupervised evaluation of goal-oriented dialogue NLG
Goal-oriented dialogue systems are typically deployed for narrow domains such as restaurant-reservation or movie-ticket-booking and thus, there is not too much diversity in how a response can be constructed given some dialogue acts. This indicates the possibility that unsupervised metrics might correlate better with human evaluation in this case compared to the non-goal-oriented case.
Table: Correlation of automated metrics with human evaluation scores
We evaluate the correlation between human evaluation and automated metrics. For details on the correlation evaluation methodology please refer to our paper (opens in new tab). We present our findings in the above table. Among the word-overlap based automated metrics, we found that overall, the METEOR score correlates the most with human judgments and we suggest using METEOR for goal-oriented dialogue natural language generation instead of BLEU. We also observe that these metrics are more reliable in the goal-oriented dialogue setting compared to the general, non-goal-oriented one due to the limited possible diversity in the goal-oriented setting. Also, we observe that word-overlap based metrics correlate better with human evaluation when multiple references are provided, i.e., dialogue acts are provided with several sentences to compare to instead of just one.
In terms of absolute performance of our models, high results on automated metrics achieved by our models on the DSTC2 and the Restaurants datasets (two datasets on the restaurant domain; DSTC2 is provided with only one sentence for each dialogue act whereas Restaurants provides two sentences) lead us to conclude that these datasets are not very challenging for the NLG task. The goal-oriented dialogue community should move towards using larger and more complex datasets, which have been recently announced, such as the Frames dataset (opens in new tab) or the E2E NLG Challenge dataset (opens in new tab).
Read the paper > (opens in new tab)
Source Code for evaluating NLG systems
We are releasing the source code for evaluating machine-generated natural language with these metrics. This code computes pre-existing word-overlap-based and embedding-similarity-based metrics at once using a single command. The code requires a file containing the generated sentences and one or more files containing ground truth reference sentences. Using these, it computes scores on the following metrics as described in our paper:
- BLEU
- METEOR
- ROUGE
- CIDEr
- Skip Thought cosine similarity
- Embedding Average cosine similarity
- Vector Extrema cosine similarity
- Greedy Matching score