Benefiting from a confluence of factors, such as service-oriented architecture, cloud computing, and Internet-of-Things (IoT), application program interfaces – APIs – are playing an increasingly important role in both the virtual and the physical world. For example, web services, such as those featuring weather, sports, and finance, hosted in the cloud provide data and services to end users via web APIs and IoT devices expose their functionalities via APIs to other devices on the network.
Traditionally, APIs are mainly consumed by various kinds of software – desktop applications, websites, and mobile apps – that then serve users via graphical user interfaces (GUIs). GUIs have greatly contributed to the popularization of computing, but many limitations have gradually presented themselves as the computing landscape evolves. As computing devices become smaller, more mobile and more intelligent, the requirement of a screen for GUIs becomes a burden in many cases, such as in wearables and IoT devices. Users must also adapt to different ad-hoc GUIs to use different services and devices. As the number of available services and devices rapidly increases, the learning and adaptation cost on users increases. Natural language interfaces – NLIs – show significant promise as a unified and intelligent gateway to a wide range of back-end services and devices. NLIs have enormous potential to help capture user intent and contextual information to enable applications such as virtual assistants to better serve their users.
We have been studying natural language interfaces to APIs (NL2APIs). Different from general-purpose NLIs like virtual assistants, we examined how to build NLIs for individual web APIs, for example, the API to a calendar service. Such NL2APIs have the potential to democratize APIs by helping users communicate with software systems. They can also address the scalability issue of general-purpose virtual assistants by allowing for distributed development. The usefulness of a virtual assistant is largely determined by its breadth, that is, the number of services it supports. However, it is tedious for a virtual assistant to integrate web services one by one. If there was a simple way for individual web service providers to build an NLI to their respective APIs, integration costs could be greatly reduced. A virtual assistant then need not handle the heterogeneous interfaces to different web services; rather, it would only need to integrate the individual NL2APIs which enjoy the uniformity of natural language. NL2APIs can also facilitate web service discovery, recommendation and help API programming by reducing the burden to memorize the available web APIs and their syntax.
The core task of NL2API is to map natural language utterances given by users into API calls. More specifically, we focus on web APIs that follow the REST architectural style, that is, RESTful APIs. RESTful APIs are widely used for web services, IoT devices, as well as smartphone apps. An example from the Microsoft Graph APIs is shown in Figure 1. The left portion of the figure shows the traditional way of building natural language where we train language understanding models to map natural language to intents, train other models to extract slots related to each intent and then map those into API calls manually by writing code. Alternatively (as shown on the right portion of the figure), we can learn to map natural language utterances directly to API calls. In our research, we apply our framework to APIs from the Microsoft Graph API suite. The Microsoft Graph APIs enable developers to connect to the data that drives productivity – mail, calendar, contacts, documents, directory, devices and more.
One of the requirements in developing our model is the ability to support fine-grained user interaction. Most existing NLIs provide users with little help in case of incorrect interpretation of user commands. We hypothesize that the support of fine-grained user interaction can greatly improve the usability of NLIs.
We developed a modular sequence-to-sequence model (see Figure 3) to enable fine-grained interaction of NLIs. To achieve that, we use a sequence-to-sequence architecture but decompose the decoder into multiple interpretable components called modules. Each module specializes in predicting a pre-defined kind of output, for example, instantiating a specific parameter by reading the input utterance in NL2API. After some simple mapping, users can easily understand the prediction of any module and interact with the system at the module level. Each module in our model generates a sequential output instead of a continuous state.
Modules: We first define modules. A module is a specialized neural network that is designed to fulfill a specific sequence prediction task. In NL2API, different modules correspond to different parameters. For example, for the GET-Messages API the modules are FILTER(sender), FILTER(isRead), SELECT(attachments), ORDERBY(receivedDateTime), SEARCH, and so on. The task of a module, if triggered, is to read the input utterance and instantiate a full parameter. To do that, a module needs to determine its parameter values based on the input utterance. For example, given an input utterance “unread emails about PhD study,” the SEARCH module needs to predict that the value of the SEARCH parameter is “PhD study,” and generate the full parameter, “SEARCH PhD study,” as its output sequence. Similarly, the FILTER(isRead) module needs to learn that phrases such as “unread emails”, “emails that have not been read” and “emails not read yet” all indicate its parameter value is False. It is natural to implement the modules as attentive decoders, as in the original sequence-to-sequence model. However, instead of a single decoder for everything, now we have multiple decoders, each of which is specialized in predicting specific parameters. Moreover, because each module has clearly defined semantics, it becomes straightforward to enable user interaction at the module level.
Controller: For any input utterance, only a few modules will be triggered. It is the job of the controller to determine which modules to trigger. Specifically, the controller is also implemented as an attentive decoder. Using the encoding of the utterance as input, it generates a sequence of modules, called the layout. The modules then generate their respective parameters and finally the parameters are composed to form the final API call.
By decomposing the complex prediction process of a typical sequence-to-sequence model into small, highly-specialized prediction units called modules, the model prediction can be easily explained to users and user feedback can be solicited to correct possible prediction errors at a granular level. In our research, we test our hypothesis by comparing an interactive NLI with its non-interactive version through both simulation and human subject experiments with real-world APIs. We show that with interactive NLIs, users can achieve a higher success rate and a lower task completion time, which lead to greatly improved user satisfaction.
For more details, we encourage you to read our paper, Natural Language Interfaces with Fine-Grained User Interaction: A Case Study on Web APIs to be presented at SIGIR 2018 in Ann Arbor, Michigan.