K-Means Clustering to Improve the Accuracy of Decision Tree Response Classification
The use of deep generation with statistical-based surface generation merits from response utterances readily available from corpus. Representation and quality of the instance data are the foremost factors that affect classification accuracy of the statistical-based method. Thus, in classification task, any irrelevant or unreliable tagging of response classes represented will result in low accuracy. This study focused on improving dialogue act classification of a user utterance into a response class by clustering the semantic and pragmatic features extracted from each user utterance. A Decision tree approach is used to classify 64 mixed-initiative, transaction dialogue corpus in theater domain. The experiment shows that by using clustering technique in pre-processing stage for re-tagging response classes, the Decision tree is able to achieve 97.5% recognition accuracy in classification, better than the 81.95% recognition accuracy when using Decision tree alone.
A dialogue system is a computer system that communicates with human users via
natural language in a coherent structure to achieve certain communicative goals.
Dialogue systems deal with interaction management issues such as turn-taking
and topic management. One of the main concerns in such kind of systems is the
coherency of the response utterances, where focusing the attention on those
words provide useful information for extracting the meaning of the utterance.
This means, in response systems to generate a potential in the dialogue must
be able to recognize this response, while maintaining equal semantic content
(Keizer and Bunt, 2007; Castro et
In order to understand the role that an utterance plays in the dialogue act
(e.g., a question for information or a request to perform an action), dialogue
systems need to perform dialogue act classification and generate an appropriate
turn. In recent years, several of empirical techniques have been used to train
the dialogue act classifier (Reithinger and Maier, 1995;
Stolcke et al., 2000; Walker
et al., 2001).
Ali et al. (2009) proposed the use of Decision
tree to improve the dialogue act classification for classifying a user utterance
into a response class in classification-and-ranking architecture. The result
shows that the Decision tree classifier is well suited for classification task,
where the classifier performances achieved the best accuracy as compared to
previous techniques, which are the Bayesian networks and Maximum likelihood.
Yang et al. (2008) show that the use of dialogue
act tagging and prosodic information can help to improve the identification
of action item descriptions and agreements by training a Decision tree classifier
using prosodic features. In addition to these features, confidence score of
action motivators and prosody can be extracted automatically without human tagging
Decision trees are able to learn from conversations and optimize the dialogue
management as well as minimize the number of turn-taking steps in the dialogue.
The Decision tree represents the chronological ordering of the actions via the
parent-child relationship and uses an object frame to represent the information
state. The findings show that the basis of a dialogue management communication
mechanism that supports decision processes based on Decision tree can be successfully
applied in dialogue applications such as contact center solutions (Fodor,
Olguin and Cortes (2006) presented a methodology that
promises a simple way to identify dialogue act types for the construction of
dialogue managers the methodology proposed is CART-style decision trees on a
corpus data, where predictor data are utterance duration and sentence mood and
the target data is the dialogue act type. First, sentence mood is predicted
from INTSINT intonation tagging. The utility of predicting sentence mood was
shown by comparing trees where tagged sentence mood, predicted sentence mood
and no sentence mood at all were assessed. The resulting decision trees can
be represented in the form of if-then rule sets, which can be programmed into
a dialogue management system to identify the dialogue act type of an unknown
Komatani et al. (2005) proposed an abstract structure
of a database search task and model it in two modes: specifying query conditions
and requesting detailed information. Then, a set of very simple dialogue acts
is defined corresponding to the above dialogue model. Furthermore, they create
a model to maintain query conditions as a tree structure, which can be used
as a weight between attributes of query conditions. The constraints derived
from these models are integrated by using a Decision Tree learning, so that
the system can determine a dialogue act of the utterance and whether each content
word should be accepted or rejected, even when it contains Automatic Speech
Recognition (ASR) errors.
In classification-and-ranking, the decisions to choose from one response utterance
over another require a considerable amount of domain knowledge. Hence, a knowledge-based
approach as in deep generation is absolutely necessary. Deep generation determines
the content of an utterance, or what to say, while the surface generation realizes
the structure of the utterance, or determines how to say. Because deep generation
requires a high degree of linguistic abstraction to produce fine-grained input
specifications in order to drive the surface generators (Varges
and Purver, 2006; Langkilde-Geary, 2002; Belz,
2007), its primary drawback is the classic problem of knowledge engineering
Overgeneration and ranking approaches to natural language generation have become
increasingly popular (Paiva and Evans, 2005; Oh
and Rudnicky, 2000). Overgeneration-and-ranking in dialogue processing performs
mild overgeneration of candidate, followed by ranking to select the highest-ranked
candidate as output (Varges and Purver, 2006). The main
problem with this approach that it has to generate more candidates to form sentences.
In addition to that, language models like n-gram have a built-in bias towards
shorter strings is calculated as the likelihood of a string of words is the
joint probability of the words. More precisely, the product of the probabilities
of each word is given by n-1 preceding words (Belz, 2007).
It is clear that this is not necessary for generation of dialogue utterances
because all candidates must be treated equally, regardless of the length and
the language rules.
In this study, we proposed a response classification experiment based on user
intentions using Decision tree. The intention-based response generation systems
require the task of classifying the response utterances into response classes.
A response class contains all response utterances that are coherent to a particular
input utterance. We used K-means clustering method for response classes re-tagging
to reduce any irrelevant or unreliable tagging of response classes represented.
Classification-based NLG has been carried out for tasks in deep generation to
guide the process of surface generation (Marciniak and Strube,
2004). However, as Stent (2002), former classification-based
experiments do not take a full stochastic approach to response generation, but
rather only in deep generation.
SCHISMA (SCHouwburg Informatie Systeem) dialogue corpus is a collection of
64 text-based dialogues of a theater information and reservation system of tickets.
Users are enabled to make inquiries about theatre performances scheduled and
if they wish to make ticket reservations. Figure 1 shows an
extract of dialogues data in SCHISMA. The corpus obtained through a series of
Wizard of Oz experiments, built purposely for the acquisition of dialogue corpus
for theater domain. The corpus contains 920 user utterances and 1127 server
utterances in total. SCHISMA corpus is a mixed-initiative (Hulstijn
and van Hessen, 1998).
There are two types of interaction: inquiry and transaction. During inquiry the user has the initiative; the system answers the users questions. When the user has indicated that he or she wants a reservation transaction the system takes initiative. The system will ask user series of questions like number of tickets to reserve, discount cards and others. User will answer the questions to complete the reservation details required by the system.
|| SCHISMA dialogue excerpt
|| FLF and BLF for SCHISMA
In transaction dialogue, before it reaches the stage of booking, the user and
the system must cooperate to reach agreement on several issues such as the value
of the ticket, the seating arrangement or the availability of discount. This
model is more complex than the answer to the question systems because the system
at any time, either party may request information from each other, especially
for the user, it might come back out of any previous decisions and to start
talking about a total opposite direction (Traum, 1997).
The SCHISMA corpus is tagged using dialogue act annotation scheme based on
Dialogue Act Markup in Several Layers (DAMSL) framework by Keizer
and den Akker (2007). Table 1 shows the dialogue acts,
represented as FLFs and BLFs in SCHISMA corpus. SCHISMA-DAMSL consists of five
layers, each of which covers different aspect of communicative functions. This
study concerned on two levels only, the forward-looking and backward-looking
functions. Both levels indicate the communicative functions of an utterance.
The FLF tags indicate the type of speech act that the utterance is conveying,
for example, assert, info-request and commit. BLF tags indicate how the particular
utterance relates to the previous utterance and include answers (positive, negative
or no-feedback) to questions, degree of understanding or disagreement.
K-MEANS CLUSTERING FOR PRE-PROCESSING
Clustering is an important tool for a variety of applications in data mining,
statistical data analysis, data compression and vector quantization. The goal
of clustering is to group data into clusters such that the similarities among
data members within the same cluster are maximal while similarities among data
members from different clusters are minimal (Deelers and
Auwatanamongkol, 2007). K-means is a well known prototype-based, partitioning
clustering technique that attempts to find a user-specified number of clusters
(K), which are represented by their centroids. The K-means algorithm is as follows:
||Select initial centers of the K clusters. Repeat steps 2 through
3 until the cluster membership stabilizes
||Generate a new partition by assigning each data to its closest
||Compute new cluster centers as the centroids of the clusters
DECISION TREE FOR RESPONSE CLASSIFICATION
Decision tree learning is a method for approximating discrete-valued target
functions, in which the learned function is represented by a Decision tree.
Learned trees can also be re-represented as sets of if-then rules to improve
human readability. These learning methods are among the most popular algorithms
and have been successfully applied to a broad range of tasks from learning to
diagnose medical cases to learning to assess credit risk of loan applicants
(Bar-Or et al., 2005). Decision trees classify
instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. Each node in the tree specifies
a test of some attribute of the instance and each branch descending from that
node corresponds to one of the possible values for this attribute (Mitchell,
Response classification is part of a two-staged classification-and-ranking
architecture as shown in Fig. 2. This architecture proposed
by Mustapha et al. (2008). The first component
is a classifier that classifies user input utterances into response classes
based on their contextual, pragmatic interpretations. The second component is
a ranker that scores the candidate response utterances according to semantic
content relevant to the input utterance.
One approach to classification would be to generate all possible Decision trees
that correctly classify the training set and to select the simplest of them.
The number of such trees is finite but very large, so this approach would only
be feasible for small classification tasks. ID3 Decision tree was designed for
the other end of the spectrum, where there are many attributes and the training
set contains many objects, but where a reasonably good Decision tree is required
without much computation (Quinlan, 1986).
The basic structure of ID3 is iterative (Li and Aiken,
1998). A subset of the training set is chosen at random and a Decision tree
formed from it; this tree correctly classifies all instances in the subset.
All other instances in the training set are then classified using the tree.
If the tree gives the correct answer for all of these instances, it is correct
for the entire training set and the process terminates. If not, a selection
of the incorrectly classified instances is added to the subset and the process
continues. This procedure will always produce a Decision tree that correctly
classifies each instance in the training set, provided that a test can always
be found that gives a nontrivial partition of any set of instances. For ID3,
the choice of test is the selection of an attribute for the root of the tree.
ID3 adopts a mutual-information criterion to choose that attribute to branch
on that gains the most information. So, the inductive biases inherent in ID3
are preference biases that explicitly search for a simple hypothesis.
The main task of any classification task is to identify the set of classes
that some observation belongs to, which is in this study to identify a response
class for each response utterances, such that P(response class|user utterance).
The purpose of the response classification is to find the proper recognition
for the accuracy of correct predictions of response class rc, given the user
||The two-staged classification-and-ranking architecture
The user utterances are characterized by semantic and pragmatic features represented by nodes in the Decision tree, at each node selecting the utterance properties that uniquely constitute the user utterance U that best classified. This process continues until the tree perfectly classified, or until all features have been used. We use rc to mean our estimate of the correct response class.
The performed experiments concerned on classification of user input utterances into response classes based on features extracted from user input utterances. The experiment was carried out in two stages. Figure 3 illustrates the stages of the experiments. SCHISMA provides 920 instances of user utterance from 64 dialogues. In the first stage, the response class for each user utterance is re-tagged according to topic of the response utterances using K-means clustering method based on the semantic and pragmatic features extracted from each user utterance.
Response class tagging adapts to patterns of input and response utterance per turn throughout the course of conversation to maintain the coherency in a sequence of two utterances. There were 15 response classes using the same naming conventions as topic in user utterance. Table 2 shows the statistics for the response classes with and without clustering.
The second stage, 10-fold cross validation is performed to split the data into ten approximately equal partitions, each being used in turn for testing while the remainder of data is used for training.
|| The stages of the experiment
|| Statistics for response classes
|| Features used as nodes in Decision tree
|| Response classification accuracy comparison
|| Performance evaluation of the classifiers
Table 3 shows the semantic and pragmatic feature used in the clustering and classification experiment. The speech acts FLF and grounding acts BLF from user utterances readily available from the DAMSL-annotated SCHISMA corpus.
Table 4 relates the results for response classification experiment
from our approach using K-means clustering method in the data pre-processing
stage accompanied by Decision tree classifier with the previous findings, which
are the Bayesian networks (Mustapha et al., 2008)
and the classification using Decision tree (Ali et al.,
2009). We achieved an accuracy result of maximum 97.5% better than the 81.95%
obtained using Decision tree and 73.9% achieved by using Bayesian networks.
Table 5 shows the performance evaluations of the classifiers, the Decision tree classifier performs better and achieves higher precision score after the use of K-means clustering method as compared to the use of Decision tree only for classification or Bayesian network approach.
DISCUSSION OF RESULTS
The significant increase in the classification accuracy result shows that our approach using K-means clustering and Decision tree achieved the aim of the study to improve dialogue act classification to classify a user utterance into response class. The Decision tree correctly classified the user utterance with higher percent recognition accuracy than the baseline approaches.
The result shows significant increase from previous study due to the reduction
of irrelevant and incorrect response classes. This is being achieved by re-tagging
response classes using K-means clustering method, thus the performance of the
Decision tree classifier perform firmness on the classification task and correctly
classified the user utterance into response class. On the contrary, the previous
work such as using Decision tree or Bayesian networks was concerned on the classification
task, so the irrelevant and incorrect tagged of response classes represented
in the corpus resulted in incorrectly and low accuracy result in the classification
This study focused on classification of response utterances into response classes. The experiment showed that the Decision tree performance improved as classifier after using the K-means clustering approach in data pre-processing stage for classification task, where the classifier performances achieved the best accuracy of 97.5%. The second component in classification and ranking architecture is a ranker that scores the response utterances in a particular response class; we need to investigate the performance of Decision tree to do the ranking of response classes.
1: Ali, S.A., N. Sulaiman, A. Mustapha and N. Mustapha, 2009. Improving accuracy of intention-based response classification using decision tree. Inform. Technol. J., 8: 923-928.
CrossRef | Direct Link |
2: Bar-Or, A., D. Keren, A. Schuster and R. Wolff, 2005. Hierarchical decision tree induction in distributed genomic databases. IEEE Trans. Knowl. Data Eng., 17: 1138-1151.
CrossRef | Direct Link |
3: Belz, A. 2007. Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Nat. Language Technol., 1: 1-26.
4: Castro, M., D. Vilar, P. Aibar and E. Sanchis, 2004. Dialogue Act Classification in a Spoken Dialogue System. Vol. 3040, Springer-Verlag, Berlin, Heidelberg, ISBN: 978-3-540-22218-7, pp: 260-270.
5: Paiva, D.S. and R. Evans, 2005. Evans empirically-based control of natural language generation. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, June 25-30, 2005, Ann Arbor, Michigan, pp: 58-65.
6: Deelers, S. and S. Auwatanamongkol, 2007. Enhancing K-means algorithm with initial cluster centers derived from data partitioning along the data axis with the highest variance. Proc. World Acad. Sci. Eng. Technol., 26: 323-328.
Direct Link |
7: Fodor, P., 2007. Dialog management for decision processes. Proceedings of the 3rd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, 2007, Poznan, Poland, pp: 1-4.
8: Hulstijn, J. and A. Van-Hessen, 1998. Utterance generation for transaction dialogues. Proceedings of the 5th International Conference of Spoken Language Processing, November 30-December 4, 1998, Sydney, Australia, pp: 1-4.
9: Keizer, S. and H. Bunt, 2007. Evaluating combinations of dialogue acts for generation. Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, September 1-2, 2007, Antwerp, Belgium, pp: 158-165.
10: Komatani, K., N. Kanda, T. Ogata and H.G. Okuno, 2005. Contextual constraints based on dialogue models in database search task for spoken dialogue systems. Proceedings of the 9th European Conference on Speech Communication and Technology, Sept. 4-8, Lisbon, Portugal, pp: 877-880.
11: Langkilde-Geary, I., 2002. An empirical verification of coverage and correctness for a general-purpose sentence generator. Proceedings of the 2nd International Natural Language Generation Conference, (NLG'02), IEEE Xplore, pp: 17-24.
12: Li, W. and M. Aiken, 1998. Inductive learning from preclassified training examples: An empirical study. IEEE Trans. Syst. Man Cybernet. Part C, 28: 288-294.
CrossRef | Direct Link |
13: Marciniak, T. and M. Strube, 2004. Classification-based generation using TAG. Proceedings of the International Conference of National Language Generation, (INLG'04), Brockenhurst, UK., pp: 100-109.
14: Mitchell, T., 1997. Machine Learning, Computer Science Series. McGraw Hill, New York, ISBN: 0070428077.
15: Mustapha, A., M.N. Sulaiman, R. Mahmod and H. Selamat, 2008. Classification-and-ranking architecture for response generation based on intentions. Int. J. Comput. Sci. Network Secur., 8: 253-258.
Direct Link |
16: Oh, A.H. and A.I. Rudnicky, 2000. Stochastic natural language generation for spoken dialogue systems. Comput. Speech Language, 16: 387-407.
Direct Link |
17: Olguin, S.R.C. and L.A.P. Cortes, 2006. Predicting Dialogue Acts from Prosodic Information. In: Computational Linguistics and Intelligent Text Processing, Gelbukh, A. (Ed.). LNCS., 3878, Springer-Verlag, Berlin, Heidelberg, ISBN: 8-3-540-32205-4, pp: 355-365.
18: Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn., 1: 81-106.
19: Reithinger, N. and E. Maier, 1995. Utilizing statistical dialogue act processing in Verbmobil. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, June 26-30, 1995, Morristown, New Jersey, USA., pp: 116-121.
20: Stent, A.J., 2002. A conversation acts model for generating spoken dialogue contributions. Comput. Speech Language, 16: 313-352.
21: Stolcke, A., K. Ries, N. Coccaro, E. Shriberg and R. Bates et al., 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. Linguist., 26: 339-373.
22: Traum, D., 1997. Views on mixed-initiative interaction. Proceedings of the AAAI97 Spring Symposium on Mixed-Initiative Interaction, March 24-26, 1997, Stanford, California, pp: 169-171.
23: Varges, S. and M. Purver, 2006. Robust language analysis and generation for spoken dialogue systems. Proceedings of the ECAI Workshop on Development and Evaluation of Robust Spoken Dialogue Systems for Real Applications, August 2006, Italy, pp: 1-4.
24: Walker, M.A., R. Passonneau and J.E. Boland, 2001. Qualitative and quantitative evaluation of DARPA communicator dialogue systems. Proceedings of the 39th Annual Meeting of the on Association for Computational Linguistics, July 6-11, 2001, Morristown, New Jersey, USA., pp: 515-522.
25: Yang, F., G. Tur and E. Shriberg, 2008. Exploiting dialogue act tagging and prosodic information for action item identification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Mar. 31-Apr. 4, Las Vegas, NV., pp: 4941-4944.