Automatic Learning and Evaluation of User Centered

Automatic Learning And Evaluation Of User Centered-PDF Download

  • Date:01 Aug 2020
  • Views:3
  • Downloads:0
  • Pages:6
  • Size:645.69 KB

Share Pdf : Automatic Learning And Evaluation Of User Centered

Download and Preview : Automatic Learning And Evaluation Of User Centered

Report CopyRight/DMCA Form For : Automatic Learning And Evaluation Of User Centered


User I am searching for a song by Radiohead, System Searching for music by Radiohead Which album. User Kid A, info acquisition info presentation System A list of songs is shown on the screen. User selects an item,System You will now hear Optimistic by Radiohead. Are you happy with this option,User Yes music, Figure 1 Hierarchical dialogue structure for information seeking multimodal systems. 2 2 Method We choose Task Ease as the ultimate measure to be opti. In the following the overall method is shortly summarised mised following Clark 1996 s principle of the least effort. Please see Rieser and Lemon 2008b Rieser 2008 for which says All things being equal agents try to minimize. details their effort in doing what they intend to do. The PARADISE regression model is constructed from 3 dif. 1 We obtain an objective function from the WOZ data ferent corpora the SAMMIE WOZ experiment Rieser et. of Rieser et al 2005 according to the PARADISE al 2005 and the iTalk system used for the user tests. framework In PARADISE multivariate linear regres Rieser and Lemon 2008b running the supervised baseline. sion is applied to experimental dialogue data in order policy and the RL based policy By replicating the regres. to develop predictive models of user preferences ob sion model on different data sets we test whether the auto. tained from questionnaires as a linear weighted func matic estimate of Task Ease generalises beyond the condi. tion of dialogue performance measures such as dia tions and assumptions of a particular experimental design. logue length This predictive model is used to auto The resulting models are shown in Equations 1 3 where. matically evaluate dialogues For RL this function is T askEaseW OZ is the regression model obtained from the. used as the reward for training WOZ data T askEaseSL is obtained from the user test. data running the supervised policy and T askEaseRL is. 2 We train an RL based dialogue system with the ob obtained from the user test data running the RL based pol. tained model The hypothesis is that by using the ob icy They all reflect the same trends longer dialogues. tained quality measures as a reward function for RL measured in turns predict a lower Task Ease whereas a. we will be able to learn an improved strategy over good performance in the multimodal information presenta. a policy which simply mimics observed patterns i e tion phase multimodal score will positively influence Task. the human wizard behaviour in the data The base Ease For the iTalk user tests almost all the tasks were com. line policy is therefore constructed using Supervised pleted therefore task completion was only chosen to be a. Learning SL on the WOZ data We then test both predictive factor for the WOZ model. strategies SL and RL with real users using the same. objective evaluation function T askEaseW OZ 1 58 12 taskCompl. 09 mmScore 20 dialogueLength 1, 3 Since the objective function plays such a central role.
T askEaseSL 3 50 54 mmScore, in automatic dialogue design we need to find methods. that ensure its quality In this paper we evaluate the 34 dialogueLength 2. obtained function in a test retest comparison between T askEaseRL 3 80 49 mmScore. the model obtained from the WOZ study and the one 36 dialogueLength 3. obtained when testing the real system as described in. To evaluate the obtained regression models we use two. the following, measures how well they fit the data goodness of fit and. how close the functions are to each other model replica. 3 Model Stability bility For the WOZ model the data fit was rather low. For the information acquisition phase we applied stepwise RW OZ 03 whereas for the models obtained from. multivariate linear regression to select the dialogue features the iTalk system the fit has improved RRL 48 and. which are most predictive for perceived Task Ease Task RSL 55. Ease is a measure from the user questionnaires obtained by To directly compare the functions we plotted them in 3D. taking the average of two user ratings on a 5 point Likert space the 4th dimension for T askEaseW OZ was omitted. scale see Figure 2 While the models obtained with the iTalk sys. tem show almost perfect overlap R2 98 the reduced, 1 The task was easy to solve WOZ model differs R2 22 in the sense that it assigns. 2 I had no problems finding the information I wanted 1. for R2 we use the adjusted values, less weight to dialogue length and the multimodal presen the prediction performance is comparably good see Table. tation score 1 ID 1 3 The models also generalise well across systems. see Table 1 ID 4 5,ID train test RMSE error,1 WOZ SAMMIE WOZ SAMMIE 0 82 16 42.
2 SL iTalk SL iTalk 1 27 18 14,3 RL iTalk RL iTalk 1 06 15 14. 4 RL iTalk SL iTalk 1 23 17 57,5 SL iTalk RL iTalk 1 03 14 71. Table 1 Prediction accuracy for models within 1 3 and. across data sets 4 5, In addition we evaluate model accuracy following a. method introduced by Engelbrecht and Mo ller 2007, They suggest to compare model performance by plotting. mean values for predicted and true ratings by averaging. Figure 2 3D Visualisation of the objective functions ob over conditions We replicate this method averaging mean. tained from WOZ data and real user data using a SL and ratings for observed and predicted Task Ease over number. RL based strategy of turns The resulting graphs in Table 2 show that the pre. dicted mean values per turn are fairly accurate for the SL. and RL objective functions first two graphs from the left. 4 Model Performance Prediction Accuracy For the WOZ data the predictions are less accurate espe. cially for low numbers of turns graph on the right This is. We now investigate how well these models generalise by. testing their prediction accuracy Previous research eval due to the fact that for low numbers of turns only very few. uated two aspects how well a given objective function is observations are in the training set 25 of the dialogues. able to predict unseen events from the original system En are between 5 and 6 turns long where the predictions are. gelbrecht and Mo ller 2007 and how well it is able to pre close to the observations and 42 of dialogue are over 14. dict unseen events of a new different system Walker et al turns long where the curves converge again Only 33. 2000 We evaluate these two aspects as well the only covers the span between 7 13 turns where the graphical. difference is that we use the Root Mean Standard Error comparison indicates low prediction performance How. RMSE instead of R2 for measuring the models prediction ever these results are misleading for small data sets as we. accuracy RMSE is as we argue more robust for small argue Quite the contrary is the case the predicted val. data sets In particular we argue that by correcting for ues show that the linear model does well for the majority of. variance R2 can lead to artificially good results when using. the cases and is not sensitive to outliers i e the graph only. small tests sets which typically vary more and is sensitive. to outliers see Equation 4 RMSE instead measures the diverges if there are too little obeservations It therefore. root mean difference between actual and predicted values generalises well. see Equation 5,5 Error Analysis, i 1 i In previous work we showed that the RL based policy sig.
y y nificantly outperforms the supervised policy in terms of. u n improved user ratings and dialogue performance measures. RM SE t yi ybi 2 5 Rieser and Lemon 2008b Here we test the relationship. n between improved user ratings and dialogue behaviour i e. we investigate which factors lead the users to give higher. First we measure the predictive power of our models within scores and whether this was correctly reflected in the orig. the same data set using 10 fold cross validation and across inal reward function. the different systems by testing models trained on one sys We concentrate on the information presentation phase. tem to predict perceived Task Ease for another system fol since there is a simple two way relationship between user. lowing a method introduced by Walker et al 2000 scores and the number of presented items To estimate this. The results for comparing the RMSE max 7 for SL RL relationship we use curve fitting which is used as an alter. and max 5 for WOZ for training and testing within data native model to linear regression in cases where the rela. sets ID 1 3 and across data sets ID 4 5 are shown in tionship between two variables can also be non linear For. Table 1 In order to present results from different scales we each presentation mode verbal vs multimodal we select. also report the percentage of the RMSE of the maximum the simplest model with the closest fit to the data R2. error error The results show that predictions according. to PARADISE can lead to accurate test results despite the 5 1 Training. low data fit While for the regression model obtained from We first use this method to construct the reward function. the WOZ data the fit was 10 times lower than for SL RL for policy learning from the WOZ data Figure 3 shows. Table 2 Average Task Ease ratings for dialogues of different length in turns the solid lines are the true ratings and the. dashed line the predicted values, the employed reward function for information presentation hypothesis is that if the policy is good i e consistently. modelled from the WOZ data The straight line presents the making the right decisions this will result in equally high. objective function for verbal presentation and the quadratic scores for all presented items represented by a straight line. curve the one for multimodal presentation whereas if the curve is not linear this indicates that the. policy was sometimes making the right decision and some. reward function for information presentation times not. multimodal presentation MM x The estimated relationship between the average number of. verbal presentation Speech x, 0 turning point 14 8 items presented verbally and the verbal presentation score. from the user questionnaire is shown in the left column of. Table 3 The straight slightly declining line indicates that. intersection point, 20 the policies in general make the right decision although. the fewer items they present the better For verbal presenta. user score, tion both learning schemes RL and SL were able to learn. a policy from the WOZ data which received consistently. good ratings from the users between 6 5 for RL and 5 4. for SL on a 7 point Likert scale, For multimodal presentation the WOZ objective function.
has a turning point at 14 8 see Figure 3 The RL based. 70 policy learned to maximise the returned reward by display. ing no more than 15 items The SL policy in contrast. 0 10 20 30 40 50 60 70 did not learn an upper boundary for when to show items. on the screen since the wizards did not follow a specific. Figure 3 WOZ objective function for the information pre pattern Rieser and Lemon 2008b When relating num. sentation phase ber of items to user scores the RL policy produces a linear. slightly declining line between 7 and 6 Table 3 bottom. In the WOZ experiments wizards never presented more right indicating that the applied policy reflected the users. than 3 items using speech resulting in a linearly decreasing preferences Hence we conclude that the objective func. line This fact was captured by the learning schemes in dif tion derived from the WOZ data gave the right feedback to. ferent ways SL extracted the rule never present more than the learner. 3 items using speech For RL the extrapolated line assigns For the SL policy the Logarithmic function best describes. negative values to more than 5 verbally presented items and the data It function indicates that the multimodal presen. intersects with the multimodal reward at 2 62 i e for more tation strategy received the highest scores if the number. than 3 items the returned reward is higher when present of items presented were just under 15 Table 3 top right. ing multimodally Therefore the RL based strategy learns which is the turning point of the WOZ objective function. to present up to 3 items verbally on average not more than This again indicates that for the iTalk users the preferred. 2 4 items per dialogue multimodal policy was indeed the one reflected in the WOZ. objective function,5 2 Testing, We now apply the same curve fitting method on the iTalk 6 Conclusion. user test data in order to test whether the policy optimisa This paper introduces data driven methods for obtaining re. tion had been successful We therefore compare the curve liable objective functions for dialogue system design and. fitting model obtained from the system running the RL pol so steers dialogue design towards science rather than art. icy against the model obtained from the SL policy The We applied data driven methods to build objective func. corpus verbal multimodal, Table 3 Objective functions for information presentation. tions for both dialogue policy learning and evaluation incrementally training a system according to improved rep. reflecting the needs of real users In particular we de resentations of real user preferences for example gathered. rived a non linear objective function from Wizard of Oz online from a deployed spoken dialogue system. data which is used to automatically train a Reinforcement This work also introduces non linear objective functions for. Learning based dialogue strategy which was then evalu dialogue optimization which merit further exploration in. Automatic Learning and Evaluation of User Centered Objective Functions for Dialogue System Optimisation Verena Rieser and Oliver Lemon School of Informatics University of Edinburgh UK vrieser olemon inf ed ac uk Abstract The ultimate goal when building dialogue systems is to satisfy the needs of real users but quality assurance for dialogue strategies is a non trivial problem The applied

Related Books