Subscribe Now Subscribe Today
Research Article

K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification

S.A. Ali, N. Sulaiman, A. Mustapha and N. Mustapha
Facebook Twitter Digg Reddit Linkedin StumbleUpon E-mail

The use of deep generation with statistical-based surface generation merits from response utterances readily available from corpus. Representation and quality of the instance data are the foremost factors that affect classification accuracy of the statistical-based method. Thus, in classification task, any irrelevant or unreliable tagging of response classes represented will result in low accuracy. This study focused on improving dialogue act classification of a user utterance into a response class by clustering the semantic and pragmatic features extracted from each user utterance. A Decision tree approach is used to classify 64 mixed-initiative, transaction dialogue corpus in theater domain. The experiment shows that by using clustering technique in pre-processing stage for re-tagging response classes, the Decision tree is able to achieve 97.5% recognition accuracy in classification, better than the 81.95% recognition accuracy when using Decision tree alone.

Related Articles in ASCI
Search in Google Scholar
View Citation
Report Citation

  How to cite this article:

S.A. Ali, N. Sulaiman, A. Mustapha and N. Mustapha, 2009. K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification. Information Technology Journal, 8: 1256-1262.

DOI: 10.3923/itj.2009.1256.1262



A dialogue system is a computer system that communicates with human users via natural language in a coherent structure to achieve certain communicative goals. Dialogue systems deal with interaction management issues such as turn-taking and topic management. One of the main concerns in such kind of systems is the coherency of the response utterances, where focusing the attention on those words provide useful information for extracting the meaning of the utterance. This means, in response systems to generate a potential in the dialogue must be able to recognize this response, while maintaining equal semantic content (Keizer and Bunt, 2007; Castro et al., 2004).

In order to understand the role that an utterance plays in the dialogue act (e.g., a question for information or a request to perform an action), dialogue systems need to perform dialogue act classification and generate an appropriate turn. In recent years, several of empirical techniques have been used to train the dialogue act classifier (Reithinger and Maier, 1995; Stolcke et al., 2000; Walker et al., 2001).

Ali et al. (2009) proposed the use of Decision tree to improve the dialogue act classification for classifying a user utterance into a response class in classification-and-ranking architecture. The result shows that the Decision tree classifier is well suited for classification task, where the classifier performances achieved the best accuracy as compared to previous techniques, which are the Bayesian networks and Maximum likelihood.

Yang et al. (2008) show that the use of dialogue act tagging and prosodic information can help to improve the identification of action item descriptions and agreements by training a Decision tree classifier using prosodic features. In addition to these features, confidence score of action motivators and prosody can be extracted automatically without human tagging cost.

Decision trees are able to learn from conversations and optimize the dialogue management as well as minimize the number of turn-taking steps in the dialogue. The Decision tree represents the chronological ordering of the actions via the parent-child relationship and uses an object frame to represent the information state. The findings show that the basis of a dialogue management communication mechanism that supports decision processes based on Decision tree can be successfully applied in dialogue applications such as contact center solutions (Fodor, 2007).

Olguin and Cortes (2006) presented a methodology that promises a simple way to identify dialogue act types for the construction of dialogue managers the methodology proposed is CART-style decision trees on a corpus data, where predictor data are utterance duration and sentence mood and the target data is the dialogue act type. First, sentence mood is predicted from INTSINT intonation tagging. The utility of predicting sentence mood was shown by comparing trees where tagged sentence mood, predicted sentence mood and no sentence mood at all were assessed. The resulting decision trees can be represented in the form of if-then rule sets, which can be programmed into a dialogue management system to identify the dialogue act type of an unknown utterance.

Komatani et al. (2005) proposed an abstract structure of a database search task and model it in two modes: specifying query conditions and requesting detailed information. Then, a set of very simple dialogue acts is defined corresponding to the above dialogue model. Furthermore, they create a model to maintain query conditions as a tree structure, which can be used as a weight between attributes of query conditions. The constraints derived from these models are integrated by using a Decision Tree learning, so that the system can determine a dialogue act of the utterance and whether each content word should be accepted or rejected, even when it contains Automatic Speech Recognition (ASR) errors.

In classification-and-ranking, the decisions to choose from one response utterance over another require a considerable amount of domain knowledge. Hence, a knowledge-based approach as in deep generation is absolutely necessary. Deep generation determines the content of an utterance, or what to say, while the surface generation realizes the structure of the utterance, or determines how to say. Because deep generation requires a high degree of linguistic abstraction to produce fine-grained input specifications in order to drive the surface generators (Varges and Purver, 2006; Langkilde-Geary, 2002; Belz, 2007), its primary drawback is the classic problem of knowledge engineering bottleneck.

Overgeneration and ranking approaches to natural language generation have become increasingly popular (Paiva and Evans, 2005; Oh and Rudnicky, 2000). Overgeneration-and-ranking in dialogue processing performs mild overgeneration of candidate, followed by ranking to select the highest-ranked candidate as output (Varges and Purver, 2006). The main problem with this approach that it has to generate more candidates to form sentences. In addition to that, language models like n-gram have a built-in bias towards shorter strings is calculated as the likelihood of a string of words is the joint probability of the words. More precisely, the product of the probabilities of each word is given by n-1 preceding words (Belz, 2007). It is clear that this is not necessary for generation of dialogue utterances because all candidates must be treated equally, regardless of the length and the language rules.

In this study, we proposed a response classification experiment based on user intentions using Decision tree. The intention-based response generation systems require the task of classifying the response utterances into response classes. A response class contains all response utterances that are coherent to a particular input utterance. We used K-means clustering method for response classes re-tagging to reduce any irrelevant or unreliable tagging of response classes represented. Classification-based NLG has been carried out for tasks in deep generation to guide the process of surface generation (Marciniak and Strube, 2004). However, as Stent (2002), former classification-based experiments do not take a full stochastic approach to response generation, but rather only in deep generation.


SCHISMA (SCHouwburg Informatie Systeem) dialogue corpus is a collection of 64 text-based dialogues of a theater information and reservation system of tickets. Users are enabled to make inquiries about theatre performances scheduled and if they wish to make ticket reservations. Figure 1 shows an extract of dialogues data in SCHISMA. The corpus obtained through a series of Wizard of Oz experiments, built purposely for the acquisition of dialogue corpus for theater domain. The corpus contains 920 user utterances and 1127 server utterances in total. SCHISMA corpus is a mixed-initiative (Hulstijn and van Hessen, 1998).

There are two types of interaction: inquiry and transaction. During inquiry the user has the initiative; the system answers the user’s questions. When the user has indicated that he or she wants a reservation transaction the system takes initiative. The system will ask user series of questions like number of tickets to reserve, discount cards and others. User will answer the questions to complete the reservation details required by the system.

Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification
Fig. 1: SCHISMA dialogue excerpt

Table 1: FLF and BLF for SCHISMA
Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification

In transaction dialogue, before it reaches the stage of booking, the user and the system must cooperate to reach agreement on several issues such as the value of the ticket, the seating arrangement or the availability of discount. This model is more complex than the answer to the question systems because the system at any time, either party may request information from each other, especially for the user, it might come back out of any previous decisions and to start talking about a total opposite direction (Traum, 1997).

The SCHISMA corpus is tagged using dialogue act annotation scheme based on Dialogue Act Markup in Several Layers (DAMSL) framework by Keizer and den Akker (2007). Table 1 shows the dialogue acts, represented as FLFs and BLFs in SCHISMA corpus. SCHISMA-DAMSL consists of five layers, each of which covers different aspect of communicative functions. This study concerned on two levels only, the forward-looking and backward-looking functions. Both levels indicate the communicative functions of an utterance. The FLF tags indicate the type of speech act that the utterance is conveying, for example, assert, info-request and commit. BLF tags indicate how the particular utterance relates to the previous utterance and include answers (positive, negative or no-feedback) to questions, degree of understanding or disagreement.


Clustering is an important tool for a variety of applications in data mining, statistical data analysis, data compression and vector quantization. The goal of clustering is to group data into clusters such that the similarities among data members within the same cluster are maximal while similarities among data members from different clusters are minimal (Deelers and Auwatanamongkol, 2007). K-means is a well known prototype-based, partitioning clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids. The K-means algorithm is as follows:

Select initial centers of the K clusters. Repeat steps 2 through 3 until the cluster membership stabilizes
Generate a new partition by assigning each data to its closest cluster centers
Compute new cluster centers as the centroids of the clusters


Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a Decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability. These learning methods are among the most popular algorithms and have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants (Bar-Or et al., 2005). Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute (Mitchell, 1997).

Response classification is part of a two-staged classification-and-ranking architecture as shown in Fig. 2. This architecture proposed by Mustapha et al. (2008). The first component is a classifier that classifies user input utterances into response classes based on their contextual, pragmatic interpretations. The second component is a ranker that scores the candidate response utterances according to semantic content relevant to the input utterance.

One approach to classification would be to generate all possible Decision trees that correctly classify the training set and to select the simplest of them. The number of such trees is finite but very large, so this approach would only be feasible for small classification tasks. ID3 Decision tree was designed for the other end of the spectrum, where there are many attributes and the training set contains many objects, but where a reasonably good Decision tree is required without much computation (Quinlan, 1986).

The basic structure of ID3 is iterative (Li and Aiken, 1998). A subset of the training set is chosen at random and a Decision tree formed from it; this tree correctly classifies all instances in the subset. All other instances in the training set are then classified using the tree. If the tree gives the correct answer for all of these instances, it is correct for the entire training set and the process terminates. If not, a selection of the incorrectly classified instances is added to the subset and the process continues. This procedure will always produce a Decision tree that correctly classifies each instance in the training set, provided that a test can always be found that gives a nontrivial partition of any set of instances. For ID3, the choice of test is the selection of an attribute for the root of the tree. ID3 adopts a mutual-information criterion to choose that attribute to branch on that gains the most information. So, the inductive biases inherent in ID3 are preference biases that explicitly search for a simple hypothesis.

The main task of any classification task is to identify the set of classes that some observation belongs to, which is in this study to identify a response class for each response utterances, such that P(response class|user utterance). The purpose of the response classification is to find the proper recognition for the accuracy of correct predictions of response class rc, given the user utterance U.

Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification
Fig. 2: The two-staged classification-and-ranking architecture

The user utterances are characterized by semantic and pragmatic features represented by nodes in the Decision tree, at each node selecting the utterance properties that uniquely constitute the user utterance U that best classified. This process continues until the tree perfectly classified, or until all features have been used. We use rc to mean our estimate of the correct response class.


The performed experiments concerned on classification of user input utterances into response classes based on features extracted from user input utterances. The experiment was carried out in two stages. Figure 3 illustrates the stages of the experiments. SCHISMA provides 920 instances of user utterance from 64 dialogues. In the first stage, the response class for each user utterance is re-tagged according to topic of the response utterances using K-means clustering method based on the semantic and pragmatic features extracted from each user utterance.

Response class tagging adapts to patterns of input and response utterance per turn throughout the course of conversation to maintain the coherency in a sequence of two utterances. There were 15 response classes using the same naming conventions as topic in user utterance. Table 2 shows the statistics for the response classes with and without clustering.

The second stage, 10-fold cross validation is performed to split the data into ten approximately equal partitions, each being used in turn for testing while the remainder of data is used for training.

Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification
Fig. 3: The stages of the experiment

Table 2: Statistics for response classes
Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification

Table 3: Features used as nodes in Decision tree
Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification

Table 4: Response classification accuracy comparison
Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification

Table 5: Performance evaluation of the classifiers
Image for - K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification

Table 3 shows the semantic and pragmatic feature used in the clustering and classification experiment. The speech acts FLF and grounding acts BLF from user utterances readily available from the DAMSL-annotated SCHISMA corpus.


Table 4 relates the results for response classification experiment from our approach using K-means clustering method in the data pre-processing stage accompanied by Decision tree classifier with the previous findings, which are the Bayesian networks (Mustapha et al., 2008) and the classification using Decision tree (Ali et al., 2009). We achieved an accuracy result of maximum 97.5% better than the 81.95% obtained using Decision tree and 73.9% achieved by using Bayesian networks.

Table 5 shows the performance evaluations of the classifiers, the Decision tree classifier performs better and achieves higher precision score after the use of K-means clustering method as compared to the use of Decision tree only for classification or Bayesian network approach.


The significant increase in the classification accuracy result shows that our approach using K-means clustering and Decision tree achieved the aim of the study to improve dialogue act classification to classify a user utterance into response class. The Decision tree correctly classified the user utterance with higher percent recognition accuracy than the baseline approaches.

The result shows significant increase from previous study due to the reduction of irrelevant and incorrect response classes. This is being achieved by re-tagging response classes using K-means clustering method, thus the performance of the Decision tree classifier perform firmness on the classification task and correctly classified the user utterance into response class. On the contrary, the previous work such as using Decision tree or Bayesian networks was concerned on the classification task, so the irrelevant and incorrect tagged of response classes represented in the corpus resulted in incorrectly and low accuracy result in the classification task.


This study focused on classification of response utterances into response classes. The experiment showed that the Decision tree performance improved as classifier after using the K-means clustering approach in data pre-processing stage for classification task, where the classifier performances achieved the best accuracy of 97.5%. The second component in classification and ranking architecture is a ranker that scores the response utterances in a particular response class; we need to investigate the performance of Decision tree to do the ranking of response classes.


  1. Ali, S.A., N. Sulaiman, A. Mustapha and N. Mustapha, 2009. Improving accuracy of intention-based response classification using decision tree. Inform. Technol. J., 8: 923-928.
    CrossRef  |  Direct Link  |  

  2. Bar-Or, A., D. Keren, A. Schuster and R. Wolff, 2005. Hierarchical decision tree induction in distributed genomic databases. IEEE Trans. Knowl. Data Eng., 17: 1138-1151.
    CrossRef  |  Direct Link  |  

  3. Belz, A. 2007. Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Nat. Language Technol., 1: 1-26.
    CrossRef  |  

  4. Castro, M., D. Vilar, P. Aibar and E. Sanchis, 2004. Dialogue Act Classification in a Spoken Dialogue System. Vol. 3040, Springer-Verlag, Berlin, Heidelberg, ISBN: 978-3-540-22218-7, pp: 260-270

  5. Paiva, D.S. and R. Evans, 2005. Evans empirically-based control of natural language generation. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, June 25-30, 2005, Ann Arbor, Michigan, pp: 58-65
    CrossRef  |  

  6. Deelers, S. and S. Auwatanamongkol, 2007. Enhancing K-means algorithm with initial cluster centers derived from data partitioning along the data axis with the highest variance. Proc. World Acad. Sci. Eng. Technol., 26: 323-328.
    Direct Link  |  

  7. Fodor, P., 2007. Dialog management for decision processes. Proceedings of the 3rd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, 2007, Poznan, Poland, pp: 1-4
    CrossRef  |  Direct Link  |  

  8. Hulstijn, J. and A. Van-Hessen, 1998. Utterance generation for transaction dialogues. Proceedings of the 5th International Conference of Spoken Language Processing, November 30-December 4, 1998, Sydney, Australia, pp: 1-4

  9. Keizer, S. and H. Bunt, 2007. Evaluating combinations of dialogue acts for generation. Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, September 1-2, 2007, Antwerp, Belgium, pp: 158-165

  10. Komatani, K., N. Kanda, T. Ogata and H.G. Okuno, 2005. Contextual constraints based on dialogue models in database search task for spoken dialogue systems. Proceedings of the 9th European Conference on Speech Communication and Technology, Sept. 4-8, Lisbon, Portugal, pp: 877-880
    Direct Link  |  

  11. Langkilde-Geary, I., 2002. An empirical verification of coverage and correctness for a general-purpose sentence generator. Proceedings of the 2nd International Natural Language Generation Conference, (NLG'02), IEEE Xplore, pp: 17-24

  12. Li, W. and M. Aiken, 1998. Inductive learning from preclassified training examples: An empirical study. IEEE Trans. Syst. Man Cybernet. Part C, 28: 288-294.
    CrossRef  |  Direct Link  |  

  13. Marciniak, T. and M. Strube, 2004. Classification-based generation using TAG. Proceedings of the International Conference of National Language Generation, (INLG'04), Brockenhurst, UK., pp: 100-109

  14. Mitchell, T., 1997. Machine Learning, Computer Science Series. McGraw Hill, New York, ISBN: 0070428077

  15. Mustapha, A., M.N. Sulaiman, R. Mahmod and H. Selamat, 2008. Classification-and-ranking architecture for response generation based on intentions. Int. J. Comput. Sci. Network Secur., 8: 253-258.
    Direct Link  |  

  16. Oh, A.H. and A.I. Rudnicky, 2000. Stochastic natural language generation for spoken dialogue systems. Comput. Speech Language, 16: 387-407.
    Direct Link  |  

  17. Olguin, S.R.C. and L.A.P. Cortes, 2006. Predicting Dialogue Acts from Prosodic Information. In: Computational Linguistics and Intelligent Text Processing, Gelbukh, A. (Ed.). LNCS., 3878, Springer-Verlag, Berlin, Heidelberg, ISBN: 8-3-540-32205-4, pp: 355-365
    CrossRef  |  

  18. Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn., 1: 81-106.
    CrossRef  |  

  19. Reithinger, N. and E. Maier, 1995. Utilizing statistical dialogue act processing in Verbmobil. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, June 26-30, 1995, Morristown, New Jersey, USA., pp: 116-121
    CrossRef  |  

  20. Stent, A.J., 2002. A conversation acts model for generating spoken dialogue contributions. Comput. Speech Language, 16: 313-352.
    CrossRef  |  

  21. Stolcke, A., K. Ries, N. Coccaro, E. Shriberg and R. Bates et al., 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. Linguist., 26: 339-373.
    CrossRef  |  

  22. Traum, D., 1997. Views on mixed-initiative interaction. Proceedings of the AAAI97 Spring Symposium on Mixed-Initiative Interaction, March 24-26, 1997, Stanford, California, pp: 169-171

  23. Varges, S. and M. Purver, 2006. Robust language analysis and generation for spoken dialogue systems. Proceedings of the ECAI Workshop on Development and Evaluation of Robust Spoken Dialogue Systems for Real Applications, August 2006, Italy, pp: 1-4
    Direct Link  |  

  24. Walker, M.A., R. Passonneau and J.E. Boland, 2001. Qualitative and quantitative evaluation of DARPA communicator dialogue systems. Proceedings of the 39th Annual Meeting of the on Association for Computational Linguistics, July 6-11, 2001, Morristown, New Jersey, USA., pp: 515-522
    CrossRef  |  

  25. Yang, F., G. Tur and E. Shriberg, 2008. Exploiting dialogue act tagging and prosodic information for action item identification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Mar. 31-Apr. 4, Las Vegas, NV., pp: 4941-4944
    CrossRef  |  Direct Link  |  

©  2022 Science Alert. All Rights Reserved