CLASSiC: D6.4: Final evaluation of classic towninfo and appointment scheduling systems

  • Romain Laroche ,
  • Ghislain Putois ,
  • Philippe Bretier ,
  • Martin Arangurn ,
  • Julia Velkovska ,
  • Helen Hastie ,
  • Simon Keizer ,
  • Kai Yu ,
  • Filip Jurcicek ,
  • Oliver Lemon ,
  • Steve Young

|

URCS Technical Report 626

This document is a report on the final evaluations of the CLASSiC TownInfo and Appointment Scheduling systems. It describes the setup and results of the experiments involving real users calling 4 different systems to perform different tasks and give ratings to each dialogue. For both TownInfo and Appointment Scheduling (AS) domains, one of the evaluated systems incorporated several components from different sites within the consortium. For more details about these integrated systems, see D5.2.2 for the CLASSiC TownInfo systems, and D5.4 for the CLASSiC Appointment Scheduling systems.

For the TownInfo evaluations a total of 2046 dialogues were collected. For the AS systems, System 2 collected a total of 628 dialogues, while Systems 3 and 4 collected 740 and 709 dialogues for evaluation respectively, for a total of 2077 AS dialogues.

The main contrasts explored in the TownInfo evaluations were the effects of processing N-best lists as input to the dialogue system (using POMDP techniques) as opposed to using only 1-best ASR input, and the effects of using the trained NLG components.

The AS evaluation explores the differences between several systems: the ‘academic’ system, with and without a trained NLG component (System 2) the FT commercial system that was adapted to the experimental set-up (System 3) the FT lab system that is an evolution of the FT commercial system using questions that do not constrain the user in a predefined behaviour. This system embeds also uncertainty management. (System 4).

Part I of the report concerns the TownInfo system (System 1) and Part II concerns the Appointment Scheduling systems (Systems 2, 3, and 4) This report also presents the sociological evaluation of the Appointment Scheduling systems carried out by France Telecom / Orange Labs (Part II, Chapter 5).

Results from the TownInfo trial were mixed. Four main measures were applied: subjective success rate (PercSucc), objective partial completion based on the assigned goals (ObjSucc-AG-PC), objective full completion based on the assigned goals (ObjSuc-AG-FC), and objective full completion based on the inferred goals (ObjSucc-IG). Partial completion requires only that subjects found an appropriate venue whereas full completion required that they obtained all of the required ancillary information such as phone number and address. The inferred goals (IG) measure attempted to match the system’s responses to what the user actually asked for, rather than the assigned goals.

On partial completion, the CLASSiC system with the specialised NLG component was significantly better than the other systems. On the remaining measures, the systems were broadly similar in performance. A striking feature of all the results was that the objective measures were all much lower than the subjective success rates (PercSucc). This is thought to be mostly because users were often unaware that the venue offered did not actually satisfy their goals or that they had failed to ask for certain required information. This illustrates one of the major shortcomings of this type of trial.

One surprising feature of the TownInfo trial results was that in contrast to the simulation results, in several cases the N-best system did not perform better than the 1-best system. Evaluation of semantic accuracy indicated that there was additional information in the N-best lists from the recogniser but clearly the
dialogue manager failed to exploit it. The most likely reason for this is that the error model used in the user simulator is a poor match to the actual pattern of errors incurred in the real data. This reinforces the need to move away from training on simulators and instead training on real user data.

A major performance issue with the TownInfo trial arose from the lack of appropriate training data. This resulted in a system in the main Feb’11 trial with a word error rate ranging from 53% to 56%. However, even with these very high WERs, perceived success rates of 60% to 65% were achieved in the Feb’11 trial. This shows that the systems were fairly resilient even when operating in extremely hostile conditions. Following the trial, the data collected was used to retrain the recogniser with the result that the error rate was halved to a WER of 26%. A further trial was then conducted after the project officially ended, and the perceived success rate increased to 88%, showing the impact of the poorly trained recognition models.

Three different systems for Appointment Scheduling (AS) were also evaluated (Systems 2, 3, and 4), using over 2000 dialogues. System 3 is a variant of the deployed France Telecom 1013+ service, and System 4 is a more advanced laboratory version of this system. System 2 was built using the statistical components developed by the academic partners in the project. Although comparing Systems 2, 3 and 4 directly is not possible due to the different speech recognition components used, we can draw some general conclusions about the comparative performance of the different systems.

While commercial systems are typically deployed only after many iterations of user testing. In this case, both System 2 and System 4 were trialled following minimal testing, and achieved comparable performance to System 3 (all performing at around 80% task completion). System 3 was already the result of on-line optimisation, which resulted in a 10% task completion increase. This means that Systems’ 2 and 4 performances already exceed classical handcrafted performance. In addition, these systems were developed rapidly using the methods and tools developed during the CLASSiC project.

Regarding the trained NLG component, the version of System 2 which included the trained component for Temporal Referring Expression generation showed a statistically significant improvement in Perceived Task Success (+23.7%) and a reduction in call time of 15.7% (to appear, [20]).

The issue of how much freedom it is beneficial to give the user (i.e. user- or system-initiative) is also explored in detail, in section 4.4.

Chapter 5 also presents further detailed qualitative analysis of the AS dialogues using methods from Conversation Analysis, for example examining types of errors and interactional misalignment phenomena between the user and the system. This leads to suggestions of strategies for error recovery.

Taken together, this set of results shows that the statistical learning methods and tools developed in the CLASSiC project provide a promising foundation for future research and development into robust and adaptive spoken dialogue systems.