Abstract: The use of deep generation with statistical-based surface generation merits from response utterances readily available from corpus. Representation and quality of the instance data are the foremost factors that affect classification accuracy of the statistical-based method. Thus, in classification task, any irrelevant or unreliable tagging of response classes represented will result in low accuracy. This study focused on improving dialogue act classification of a user utterance into a response class by clustering the semantic and pragmatic features extracted from each user utterance. A Decision tree approach is used to classify 64 mixed-initiative, transaction dialogue corpus in theater domain. The experiment shows that by using clustering technique in pre-processing stage for re-tagging response classes, the Decision tree is able to achieve 97.5% recognition accuracy in classification, better than the 81.95% recognition accuracy when using Decision tree alone.
INTRODUCTION
A dialogue system is a computer system that communicates with human users via natural language in a coherent structure to achieve certain communicative goals. Dialogue systems deal with interaction management issues such as turn-taking and topic management. One of the main concerns in such kind of systems is the coherency of the response utterances, where focusing the attention on those words provide useful information for extracting the meaning of the utterance. This means, in response systems to generate a potential in the dialogue must be able to recognize this response, while maintaining equal semantic content (Keizer and Bunt, 2007; Castro et al., 2004).
In order to understand the role that an utterance plays in the dialogue act (e.g., a question for information or a request to perform an action), dialogue systems need to perform dialogue act classification and generate an appropriate turn. In recent years, several of empirical techniques have been used to train the dialogue act classifier (Reithinger and Maier, 1995; Stolcke et al., 2000; Walker et al., 2001).
Ali et al. (2009) proposed the use of Decision tree to improve the dialogue act classification for classifying a user utterance into a response class in classification-and-ranking architecture. The result shows that the Decision tree classifier is well suited for classification task, where the classifier performances achieved the best accuracy as compared to previous techniques, which are the Bayesian networks and Maximum likelihood.
Yang et al. (2008) show that the use of dialogue act tagging and prosodic information can help to improve the identification of action item descriptions and agreements by training a Decision tree classifier using prosodic features. In addition to these features, confidence score of action motivators and prosody can be extracted automatically without human tagging cost.
Decision trees are able to learn from conversations and optimize the dialogue management as well as minimize the number of turn-taking steps in the dialogue. The Decision tree represents the chronological ordering of the actions via the parent-child relationship and uses an object frame to represent the information state. The findings show that the basis of a dialogue management communication mechanism that supports decision processes based on Decision tree can be successfully applied in dialogue applications such as contact center solutions (Fodor, 2007).
Olguin and Cortes (2006) presented a methodology that promises a simple way to identify dialogue act types for the construction of dialogue managers the methodology proposed is CART-style decision trees on a corpus data, where predictor data are utterance duration and sentence mood and the target data is the dialogue act type. First, sentence mood is predicted from INTSINT intonation tagging. The utility of predicting sentence mood was shown by comparing trees where tagged sentence mood, predicted sentence mood and no sentence mood at all were assessed. The resulting decision trees can be represented in the form of if-then rule sets, which can be programmed into a dialogue management system to identify the dialogue act type of an unknown utterance.
Komatani et al. (2005) proposed an abstract structure of a database search task and model it in two modes: specifying query conditions and requesting detailed information. Then, a set of very simple dialogue acts is defined corresponding to the above dialogue model. Furthermore, they create a model to maintain query conditions as a tree structure, which can be used as a weight between attributes of query conditions. The constraints derived from these models are integrated by using a Decision Tree learning, so that the system can determine a dialogue act of the utterance and whether each content word should be accepted or rejected, even when it contains Automatic Speech Recognition (ASR) errors.
In classification-and-ranking, the decisions to choose from one response utterance over another require a considerable amount of domain knowledge. Hence, a knowledge-based approach as in deep generation is absolutely necessary. Deep generation determines the content of an utterance, or what to say, while the surface generation realizes the structure of the utterance, or determines how to say. Because deep generation requires a high degree of linguistic abstraction to produce fine-grained input specifications in order to drive the surface generators (Varges and Purver, 2006; Langkilde-Geary, 2002; Belz, 2007), its primary drawback is the classic problem of knowledge engineering bottleneck.
Overgeneration and ranking approaches to natural language generation have become increasingly popular (Paiva and Evans, 2005; Oh and Rudnicky, 2000). Overgeneration-and-ranking in dialogue processing performs mild overgeneration of candidate, followed by ranking to select the highest-ranked candidate as output (Varges and Purver, 2006). The main problem with this approach that it has to generate more candidates to form sentences. In addition to that, language models like n-gram have a built-in bias towards shorter strings is calculated as the likelihood of a string of words is the joint probability of the words. More precisely, the product of the probabilities of each word is given by n-1 preceding words (Belz, 2007). It is clear that this is not necessary for generation of dialogue utterances because all candidates must be treated equally, regardless of the length and the language rules.
In this study, we proposed a response classification experiment based on user intentions using Decision tree. The intention-based response generation systems require the task of classifying the response utterances into response classes. A response class contains all response utterances that are coherent to a particular input utterance. We used K-means clustering method for response classes re-tagging to reduce any irrelevant or unreliable tagging of response classes represented. Classification-based NLG has been carried out for tasks in deep generation to guide the process of surface generation (Marciniak and Strube, 2004). However, as Stent (2002), former classification-based experiments do not take a full stochastic approach to response generation, but rather only in deep generation.
DIALOGUE CORPUS
SCHISMA (SCHouwburg Informatie Systeem) dialogue corpus is a collection of 64 text-based dialogues of a theater information and reservation system of tickets. Users are enabled to make inquiries about theatre performances scheduled and if they wish to make ticket reservations. Figure 1 shows an extract of dialogues data in SCHISMA. The corpus obtained through a series of Wizard of Oz experiments, built purposely for the acquisition of dialogue corpus for theater domain. The corpus contains 920 user utterances and 1127 server utterances in total. SCHISMA corpus is a mixed-initiative (Hulstijn and van Hessen, 1998).
There are two types of interaction: inquiry and transaction. During inquiry the user has the initiative; the system answers the users questions. When the user has indicated that he or she wants a reservation transaction the system takes initiative. The system will ask user series of questions like number of tickets to reserve, discount cards and others. User will answer the questions to complete the reservation details required by the system.
Fig. 1: | SCHISMA dialogue excerpt |
Table 1: | FLF and BLF for SCHISMA |
In transaction dialogue, before it reaches the stage of booking, the user and the system must cooperate to reach agreement on several issues such as the value of the ticket, the seating arrangement or the availability of discount. This model is more complex than the answer to the question systems because the system at any time, either party may request information from each other, especially for the user, it might come back out of any previous decisions and to start talking about a total opposite direction (Traum, 1997).
The SCHISMA corpus is tagged using dialogue act annotation scheme based on Dialogue Act Markup in Several Layers (DAMSL) framework by Keizer and den Akker (2007). Table 1 shows the dialogue acts, represented as FLFs and BLFs in SCHISMA corpus. SCHISMA-DAMSL consists of five layers, each of which covers different aspect of communicative functions. This study concerned on two levels only, the forward-looking and backward-looking functions. Both levels indicate the communicative functions of an utterance. The FLF tags indicate the type of speech act that the utterance is conveying, for example, assert, info-request and commit. BLF tags indicate how the particular utterance relates to the previous utterance and include answers (positive, negative or no-feedback) to questions, degree of understanding or disagreement.
K-MEANS CLUSTERING FOR PRE-PROCESSING
Clustering is an important tool for a variety of applications in data mining, statistical data analysis, data compression and vector quantization. The goal of clustering is to group data into clusters such that the similarities among data members within the same cluster are maximal while similarities among data members from different clusters are minimal (Deelers and Auwatanamongkol, 2007). K-means is a well known prototype-based, partitioning clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids. The K-means algorithm is as follows:
• | Select initial centers of the K clusters. Repeat steps 2 through 3 until the cluster membership stabilizes |
• | Generate a new partition by assigning each data to its closest cluster centers |
• | Compute new cluster centers as the centroids of the clusters |
DECISION TREE FOR RESPONSE CLASSIFICATION
Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a Decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability. These learning methods are among the most popular algorithms and have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants (Bar-Or et al., 2005). Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute (Mitchell, 1997).
Response classification is part of a two-staged classification-and-ranking architecture as shown in Fig. 2. This architecture proposed by Mustapha et al. (2008). The first component is a classifier that classifies user input utterances into response classes based on their contextual, pragmatic interpretations. The second component is a ranker that scores the candidate response utterances according to semantic content relevant to the input utterance.
One approach to classification would be to generate all possible Decision trees that correctly classify the training set and to select the simplest of them. The number of such trees is finite but very large, so this approach would only be feasible for small classification tasks. ID3 Decision tree was designed for the other end of the spectrum, where there are many attributes and the training set contains many objects, but where a reasonably good Decision tree is required without much computation (Quinlan, 1986).
The basic structure of ID3 is iterative (Li and Aiken, 1998). A subset of the training set is chosen at random and a Decision tree formed from it; this tree correctly classifies all instances in the subset. All other instances in the training set are then classified using the tree. If the tree gives the correct answer for all of these instances, it is correct for the entire training set and the process terminates. If not, a selection of the incorrectly classified instances is added to the subset and the process continues. This procedure will always produce a Decision tree that correctly classifies each instance in the training set, provided that a test can always be found that gives a nontrivial partition of any set of instances. For ID3, the choice of test is the selection of an attribute for the root of the tree. ID3 adopts a mutual-information criterion to choose that attribute to branch on that gains the most information. So, the inductive biases inherent in ID3 are preference biases that explicitly search for a simple hypothesis.
The main task of any classification task is to identify the set of classes that some observation belongs to, which is in this study to identify a response class for each response utterances, such that P(response class|user utterance). The purpose of the response classification is to find the proper recognition for the accuracy of correct predictions of response class rc, given the user utterance U.
Fig. 2: | The two-staged classification-and-ranking architecture |
The user utterances are characterized by semantic and pragmatic features represented by nodes in the Decision tree, at each node selecting the utterance properties that uniquely constitute the user utterance U that best classified. This process continues until the tree perfectly classified, or until all features have been used. We use rc to mean our estimate of the correct response class.
EXPERIMENTAL OPERATIONS
The performed experiments concerned on classification of user input utterances into response classes based on features extracted from user input utterances. The experiment was carried out in two stages. Figure 3 illustrates the stages of the experiments. SCHISMA provides 920 instances of user utterance from 64 dialogues. In the first stage, the response class for each user utterance is re-tagged according to topic of the response utterances using K-means clustering method based on the semantic and pragmatic features extracted from each user utterance.
Response class tagging adapts to patterns of input and response utterance per turn throughout the course of conversation to maintain the coherency in a sequence of two utterances. There were 15 response classes using the same naming conventions as topic in user utterance. Table 2 shows the statistics for the response classes with and without clustering.
The second stage, 10-fold cross validation is performed to split the data into ten approximately equal partitions, each being used in turn for testing while the remainder of data is used for training.
Fig. 3: | The stages of the experiment |
Table 2: | Statistics for response classes |
Table 3: | Features used as nodes in Decision tree |
Table 4: | Response classification accuracy comparison |
Table 5: | Performance evaluation of the classifiers |
Table 3 shows the semantic and pragmatic feature used in the clustering and classification experiment. The speech acts FLF and grounding acts BLF from user utterances readily available from the DAMSL-annotated SCHISMA corpus.
RESULTS
Table 4 relates the results for response classification experiment from our approach using K-means clustering method in the data pre-processing stage accompanied by Decision tree classifier with the previous findings, which are the Bayesian networks (Mustapha et al., 2008) and the classification using Decision tree (Ali et al., 2009). We achieved an accuracy result of maximum 97.5% better than the 81.95% obtained using Decision tree and 73.9% achieved by using Bayesian networks.
Table 5 shows the performance evaluations of the classifiers, the Decision tree classifier performs better and achieves higher precision score after the use of K-means clustering method as compared to the use of Decision tree only for classification or Bayesian network approach.
DISCUSSION OF RESULTS
The significant increase in the classification accuracy result shows that our approach using K-means clustering and Decision tree achieved the aim of the study to improve dialogue act classification to classify a user utterance into response class. The Decision tree correctly classified the user utterance with higher percent recognition accuracy than the baseline approaches.
The result shows significant increase from previous study due to the reduction of irrelevant and incorrect response classes. This is being achieved by re-tagging response classes using K-means clustering method, thus the performance of the Decision tree classifier perform firmness on the classification task and correctly classified the user utterance into response class. On the contrary, the previous work such as using Decision tree or Bayesian networks was concerned on the classification task, so the irrelevant and incorrect tagged of response classes represented in the corpus resulted in incorrectly and low accuracy result in the classification task.
CONCLUSIONS
This study focused on classification of response utterances into response classes. The experiment showed that the Decision tree performance improved as classifier after using the K-means clustering approach in data pre-processing stage for classification task, where the classifier performances achieved the best accuracy of 97.5%. The second component in classification and ranking architecture is a ranker that scores the response utterances in a particular response class; we need to investigate the performance of Decision tree to do the ranking of response classes.