INTRODUCTION
In recent years, the rapidly growing phenomenon of creating, editing, processing,
transferring and storing data in digital form has made the automated machine
learning classification systems become important due to the advent of large
amount of electronic data. These classification systems are useful in discovering,
organizing, analyzing and mining data in order to transform data into information.
Since some decades now, an increasing number of machine learning approaches
have been developed to perform the tasks of classifying data into groups for
some specified purposes, including decision tree induction (Greiner
and Schaffer, 2001; Quinlan, 1993), rule induction
(Apte et al., 1994), self-organizing map (Isa
et al., 2008c, 2009), k-nearest neighbor
classification (Han et al., 1999; Shin
et al., 2006; Osuna, 2002; Tan,
2006), artificial neural network (Abd Alla, 2006;
Soltanizadeh and Shakirar, 2008), support vector machines
(Burges, 1998; Haykin, 1999; Isa
et al., 2008b, c; Lee,
2008; Joachims, 1998, 1999;
Lin, 1999; Sani et al., 2009;
Bayesian classification (Domingos and Pazzani, 1997;
Eyheramendy et al., 2003; Isa
et al., 2008a; Kim et al., 2002a;
Lee, 2008; Lee et al.,
2010; McCallum and Nigam, 2003;
Rish, 2001).
Each of the classification schemes previously mentioned has its own unique
properties and associated strengths and problems. The simple decision tree induction
and rule induction approaches are easy to be understood and interpreted. As
the trade-off, the low classification performance of these approaches has restricted
them to be widely implemented in real world application, especially when the
number of distinguishing features within the data is large (Greiner
and Schaffer, 2001; Quinlan, 1993). k-nearest neighbor
(KNN) is another approach which is easy to be implemented with high degree of
effectiveness in many classification tasks of varying problem domains. However,
when the amount of training data is large, k-nearest neighbor approach suffers
from the problem of computational intensive and its classification accuracy
could be drastically degraded when the number of attributes grows (Han et al., 1999; Shin et al., 2006; Osuna, 2002; Tan, 2006). Artificial neural networks outperform
the other classification approaches with its ability in handling data with high-dimensional
features and also noisy and contradictory data. However, the high computing
cost which consumes high CPU and physical memory usage has become the main disadvantage
of the artificial neural networks. Bayesian approach is outstanding with its
simplicity and low computational cost in both the training and classifying stage
and it has been widely implemented in various types of domains and applications.
However, this generative method has been reported to be less accurate than the
discriminative methods such as support vector machines (Brucher
et al., 2002; Chakrabarti et al., 2003;
Godbole, 2006; Joachims, 1998,
1999; Lin, 1999; Yang
and Pederson, 1997; Yang and Liu, 1999).
As one of the discriminative classification methods, Support Vector Machines
(SVM) classification has been shown to be more accurate than other classification
approaches (Brucher et al., 2002; Chakrabarti
et al., 2003; Godbole, 2006; Isa
et al., 2008b; Joachims, 1998, 1999;
Lee, 2008; Lin, 1999, Yang and
Pederson, 1997, Yang and Liu, 1999). The outstanding
generalization ability of the SVM is contributed by the implementation of Structural
Risk Minimization (SRM) principle which entails finding an optimal separating
hyper-plane, thus guaranteeing the lowest classification error (Haykin,
1999). This unique characteristic of the SVM has contributed to its success
in most of the classification application of varying problem domains, such as
text classification (Joachims, 1998, 1999;
Lee, 2008), spam e-mail filtering (Drucker
et al., 1999), gene expression data classification (Brown
et al., 1999), protein fold and remote homology detection (Rangwala
and Karypis, 2005), content based image retrieval (Tao
et al., 2006), face recognition (Sani et al.,
2009), alternative exons identification (Dror et
al., 2005) and texture classification (Kim et
al., 2002b; Li et al., 2003).
There exist some problems associated with the implementation of SVM as a classifier.
One of the major problems of the SVM classification approach is that efforts
are needed to transform data into a representation suitable format, typically
in numerical. Furthermore, the computation of inner products between support
vectors and data points and the iterative binary classification processes in
multi-class classification, lead to its convoluted training and categorizing
algorithms, hence consume a high computing time and cost (Burges,
1998; Haykin, 1999; Joachims, 1998,
1999; Lin, 1999).
As for the review study presented in this paper, we configure the situations and reasons for SVM classification to fail in performing effective and efficient classification tasks. It is the nature and properties of the conventional SVM algorithm that restrict the SVM classifier to perform well in certain problem domains, such as gene expression data classification, visual category recognition and spam e-mail filtering. In order to overcome the problems of the conventional SVM as mentioned previously, many research works have been carried out with the same target to improve the classification performance of SVM by modifying the algorithm of conventional SVM. In this study, we carry out the review of the research works which modify conventional SVM by integrating Nearest Neighbor (NN) algorithm, in order to develop hybrid classification models which fulfill the problem solving requirements of specified domains, hence contribute to better classification effectiveness and efficiency. Although the investigated NN-SVM hybrid classification models presented the integration of NN algorithm with SVM approach, they have their own unique way and philosophy of integration, in order to counter their own classification domains with unique problem solving requirements.
BACKGROUND OF SUPPORT VECTOR MACHINES AND k-NEAREST NEIGHBOR CLASSIFICATION APPROACHES
Support vector machines classification approach: Support Vector Machines
(SVM) is one of the discriminative classification approaches which is commonly
recognized to be more accurate. SVM classification approach is based on Structural
Risk Minimization (SRM) principle from statistical learning theory (Vapnik,
1995). SRM is an inductive principle for model selection used for learning
from finite training data and it provides a method for controlling the generalization
ability of learning machines that uses a small size training data (Zhang,
2004). The idea of this principle is to find a hypothesis to guarantee the
lowest true error. In addition to this, the derivation of SVM is mathematically
rigorous and very open to theoretical understanding and analysis (Joachims,
1998).
SVM needs both positive and negative training datasets which are uncommon for
other classification methods. These positive and negative training sets are
needed for SVM to seek for the decision surface, also known as hyper plane,
which best separate the positive from the negative data in the n-dimensional
space (Brucher et al., 2002). The document representatives
which are closest to the decision surface are called support vectors. The performance
of SVM classification remains unchanged if documents that do not belong to the
support vectors are removed from the set of training data (Brucher
et al., 2002).
SVM classification approach is outstanding from the others with its better
classification performance and its ability in handling documents with high-dimensional
input space and culls out most of the irrelevant features. The good generalization
characteristic of SVM is due to the implementation of SRM which entails finding
an optimal hyper-plane, thus guaranteeing the lowest classification error. Besides,
a capacity which is independent of the dimensionality of the feature space makes
SVM a highly accurate classifier in most applications (Joachims,
1998). However, the major drawback of SVM is its relatively complex training
and categorizing algorithms and also the high time and memory consumptions during
the training stage and classifying stage due to its convoluted training and
categorizing algorithms (Chakrabarti et al., 2003).
Besides, confusions occur during the classification tasks because the documents
could be annotated to several categories because of similarities are typically
calculated individually for each category (Brucher et
al., 2002).
SVM maps input vectors to a higher dimensional vector space where an optimal
separating hyper plane is constructed. The discussion here starts with the simplest
case with linear SVM on two categories of separable data. In a linear classifier,
two groups of separable data can be divided by a hyper plane. In Fig.
1, white dots represent data points from one category while black dots represent
data points from another category. As illustrated in Fig. 1,
there are many possible hyper planes that can separate different types of data.
However, there is only one hyper plane that maximizes the distance between itself
and the nearest data vectors of each category. This hyper plane which maximizes
the margin is called the optimal separating hyper plane and the margin is defined
as the sum of distances of the hyper plane to the closest training vectors of
each category.
|
Fig. 1: |
Optimal separating hyper plane in a linear SVM (white dots
and black dots represent data points from two different categories, respectively) |
The hyper planes that separate two groups of data can be expressed by Eq. 1.
In the equation above, x represents a set of training vectors while w represents
vectors perpendicular to the separating hyper plane and b represents the offset
parameter which allows the increase of the margin. There are actually an infinite
number of hyper planes that could separate data in a vector space into two groups,
as represented by dashed lines illustrated in Fig. 1. According
to SVM principle, there will just be one optimal hyper plane which leans half-way
in between the maximal margin. The solid line illustrated in Fig.
1 represents this optimal separating hyper plane and the margin in this
case is d1+d2 (Gutschoven and Verlinde, 2000).
If the training data vectors are linearly separable, we can select the optimal separating hyper plane from the infinite number of hyper planes so that there are no vectors between them and maximize the margin. The vectors that constrain the width of the margin are called the support vectors. According to SVM principle, only support vectors are considered in determining the optimal separating hyper plane. Therefore, there is a way to represent the support vectors for a given set of training vectors. The illustration of the optimal separating hyper plane, two parallel hyper planes closest to the support vectors and the support vectors in the vector space are shown in Fig. 2. In Fig. 2, o symbols represent data points from one category while x symbols represent data points from another category.
|
Fig. 2: |
Illustration of optimal separating hyper plane, hyper planes
and support vectors (O symbols and x symbols represent data points from
two different categories, respectively) |
By using geometry, the distance between support vectors for each category is
found as 2/|w|. As to maximize the margin, {1/2 ||w||2} need to be
minimized first (Haykin, 1999). Thus, the optimal separating
hyper plane can be configured by minimizing {1/2 ||w||2}, under the
constraint that the training data is correctly separated, as illustrated in
Eq. 2:
The discussion above describes that linear SVM on two categories of separable
data is a general and the simplest algorithm of SVM classification approach.
SVM classification approach may be more complex with cases such as multi-dimensional
classification, non-linear separable data vectors and non-linear classifications.
For n-dimensional classification by implementing SVM principle, as additional
attributes are added, data vectors can be represented in n-dimensional space
by using n-1 dimensional hyper planes to separate the data vectors into n categories
(Haykin, 1999). In the case of non-linear separable data
vectors in a SVM classification task, in order to allow some flexibility in
separating the categories, SVM models introduce positive slack variables ξi
in the constraints as illustrate in Eq. 3:
this then becomes Eq. 4:
|
Fig. 3: |
Mapping non linear input space onto high dimensional space
(squares and dots represent data points from two different categories, respectively) |
The positive slack variables ξi controls the trade off between allowing training errors and forcing rigid margins. It creates a soft margin that permits some misclassifications. The increase the value of ξi raises the cost of misclassifying vectors and forces the creation of a more accurate model that may not generalize well.
According to SVM principle, non-linear classification can be handled by introducing kernel functions to map the data vectors into a different space where a hyper plane can be used to do the separation. Kernel function is used when decision function is not a linear function of the data and the data will be mapped from the input space into a high dimensional space through a non-linear transformation rather than fitting non-linear curves to the vector space to separate the data. As the result, a hyper plane can be used to do the separation in the high dimensional space as illustrated in Fig. 3. In Fig. 3, blue squares represent data points from one category while red dots represent data points from another category.
The algorithm of the non-linear classification is formally similar to the linear classification, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum margin hyper plane in the transformed high dimensional space. Therefore, although the classifier is a hyper plane in the high dimensional space, it may be non-linear in the original input space, as illustrated in Fig. 3.
Mapping the data vectors into a high dimensional space by using kernel functions contributes to a superb performing classification task since it allows SVM models to perform data vectors separation even with very complex boundaries. Kernel functions can be in an infinite of number but only certain kernel functions have been found to perform well classification tasks in a wide variety of applications. Some common well performing kernel functions in most cases are:
• |
Polynomial (homogeneous): |
• |
Polynomial (inhomogeneous): |
• |
Gaussian Radial basis function: |
for some (not every) k> 0 and c < 0
Each of the kernel function listed above has its own properties and unique
responds in handling different kinds of data. SVM model using a sigmoid kernel
function is equivalent to a two-layer perceptron neural network (Burges,
1998). SVM model with Radial Basis Function kernel is closely related to
radial basis function neural networks and the feature space is in an infinite
dimension. Therefore, in order to perform optimal classification tasks with
SVM model, a good kernel function is required by selecting proper and optimal
kernel function based on the classification tasks requirements. With an
optimal kernel function implemented in SVM model, the classification task is
able to scale high dimensional data relatively well, tradeoff between classifier
complexity and classification error can be controlled explicitly.
k-nearest neighbor classification approach: k-nearest neighbor (KNN)
classification method is an instant-based learning algorithm that categorized
objects based on closest feature space in the training set (Han et al., 1999; Osuna, 2002). The training data is mapped into multi-dimensional
feature space. The feature space is partitioned into regions based on the category
of the training set. A point in the feature space is assigned to a particular
category if it is the most frequent category among the k nearest training data.
During the classifying stage, KNN classification approach finds the k closest
labeled training samples for an unlabeled input sample and assigns the input
sample to the category that appears most frequently within the k subset. As
KNN outperforms the other classification approaches by its simplicity, it only
requires a small training set with small number of training samples, an integer
which specifies the variable of k and a metric to measure closeness (Osuna,
2002).
|
Fig. 4: |
Feature space of a 3-dimensional KNN classifier (dots, blue
triangles and squares represent data points from ω1, ω2
and ω3, respectively; Xu represents an
unlabeled input sample) (Osuna, 2002) |
Figure 4 shows an example of the feature space of a KNN classifier
with three categories, where ω1, ω2 and ω3
represent three different categories with associated training samples
and dots, blue triangles and squares represent data points from ω1,
ω2 and ω3, respectively. Xu represents
an unlabeled input sample to be classified (Osuna, 2002).
Euclidean distance is typically used in computing the distance between the vectors. The key element of this method is the availability of a similarity measure for identifying neighbors of a particular document (Han et al., 1999). The training phase consists only of storing the feature vectors and categories of the training set. In the classifying phase, distances between the input sample and all stored vectors are computed and k closest samples are selected. The annotated category of an input sample is predicted based on the nearest point which has been assigned to a particular category. In the case of N-dimensions, the Euclidean distance between two points, p and q is computed by using the formula as illustrated in Eq. 10.
where, pi (or qi) is the coordinate of p (or q) in dimension i.
Based on the example as shown in Fig. 4, the classification task is to annotate an input sample, Xu to its right category. In this case, the value of k has been assigned as 5 and Euclidean distance is used to compute the distance between Xu to the training samples in the feature space. Of the 5 closest neighbors, 4 belong to ω1 and 1 belongs to ω3. As the result of KNN classification, Xu is annotated to the category of ω1, which is the pre-dominant category (Osuna, 2002).
KNN classification approach is outstanding with its simplicity. It performs well even in handling the classification tasks with multi-categorized documents. The major drawback of KNN is that it uses all features in distance computation and causes the method computationally intensive, especially when the size of training set grows. Besides this, the accuracy of k-nearest neighbor classification is severely degraded by the presence of noisy or irrelevant features, especially when the number of attributes grows.
As KNN classification is considered as an instant-based learning, or lazy learning approach, the algorithm suspends the processes of training data until it receives a command to classify an input sample. When the classifier receives an input sample to be classified, it compiles all the training samples in order to reply to the classification request, then abandons the constructed answer and any immediate result (Osuna, 2002). On the other hand, the eager learning algorithms, such as artificial neural networks, build the classification model during training stage by compiling the training data and discard the training data after the classification model has been built. The actuated model is then used to classify the input sample and it is retained for the classification requests in the future (Osuna, 2002). As the nature for an instance-based learning approach, the KNN classification approach consumes less processing time than the eager learning algorithms during training stage. However, KNN classification approach requires greater processor and physical memory usages and higher time consumption during the classifying stage.
There exist two classical enhancing algorithms for KNN classification approach
to reduce the time consumption and the processor and physical memory usages
during the stage of classifying, the bucketing algorithm and the k-dimensional
trees algorithm (Osuna, 2002). The bucketing algorithm
was introduced by Welch in 1971 and the k-dimensional trees algorithm
was introduced by Osuna (2002).
In the bucketing algorithm, in order to improve the efficiency of the conventional
KNN classification approach, the feature space is segmented into identical cells,
according to their category. Each individual cell contains training data points
and the training data points are recorded and stored in a list (Osuna, 2002).
During the classifying stage, the distance between the internal data points
of each cell and the input data point is computed.
|
Fig. 5: |
Bucketing algorithm for KNN classification approach (Osuna, 2002) |
The classifying process terminates when the distance between the input data
point and the cell which is currently examined, is larger than the distance
between the input data point to the cell which has been examined (Osuna, 2002).
The classification result is obtained based on the cell which has the shortest
distance to the input data point (Osuna, 2002). Figure 5 shows
an illustration for an example of the bucketing algorithm.
The k-dimensional trees algorithm for KNN classification algorithm is a binary search tree which has been generalized in high dimensions (Osuna, 2002). It is a space-partitioning data structure to organize data points into a k-dimensional feature space. The training stage starts with a partitioning process to partition the nodes in a k-dimensional tree in order to segment the multi-dimensional feature space according to the underlying distribution of the training data (Osuna, 2002). Each of the nodes of a k-dimensional tree is associated with a hyper-rectangle and a hyper-plane which are orthogonal to one of the coordinate axis (Osuna, 2002). The hyper-rectangle of each node is divided by the hyper-plane into two parts and these two parts of the hyper-rectangle are associated with the successors of the node. The training stage terminates when the number of training data points in the hyper-rectangle is less then a given threshold (Osuna, 2002). During the classifying stage, an input data point is mapped into one of the nodes in the k-dimensional tree. The algorithm searches the tree in descending to find all the training data points which included in the node that contains the input data point (Osuna, 2002). Afterwards the algorithm examines the surrounding nodes to determine whether they overlap the ball centered at the input data point. The classification result is obtained based on the label of the training data point which has the shortest distance to the input data point (Osuna, 2002).
|
Fig. 6: |
Example of the data structure of a k-dimensional tree (3-D
view) (Osuna, 2002) |
|
Fig. 7: |
(a-e) Example of a k-dimensional tree decomposition (2-D view)
(Osuna, 2002) |
Figure 6 shows an example of the data structure of k-dimensional
tree in 3-dimensional view, while Fig. 7a-e show an example
of k-dimensional tree decomposition after the partitioning process in 2-dimensional
view (Osuna, 2002).
NEAREST NEIGHBOR-SUPPORT VECTOR MACHINES HYBRID CLASSIFICATION MODELS
SVM-KNN: Discriminative nearest neighbor classification for visual category
recognition by Zhang et al. (2006): Zhang
et al. (2006) proposed an integration of nearest neighbor classifier
and support vector machine to classify visual objects, coined as SVM-KNN. The
proposed SVM-KNN hybrid classification model can be used for large and multiclass
datasets with reasonably lower computational complexity both in training and
classifying stage. This property of lower computational complexity is inherited
from KNN classification approach which does not need construction of a feature
space and it is able to handle and manage multiclass datasets with high dimensionality.
Due to this fact, the proposed SVM-KNN model is able to provide a more flexible
framework from a computer vision perspective as the emphasis is on similarity
of objects (Zhang et al., 2006).
NN algorithm has been used by SVM-KNN hybrid classification approach as the
initial process of its classification task. In the situation that there is no
conflict in the classification of a given input data, where all k neighbors
are from a same class, a straightforward classification task is performed by
the simple NN algorithm without needing the involvement of SVM approach. On
the other hand, if the k neighbors are not with the same class, the hybrid classification
approach is used to perform the classification task (Zhang
et al., 2006). In the proposed SVM-KNN hybrid classification approach,
Nearest Neighbor (NN) algorithm is utilized to prune the original training set,
so that SVM which acts as a classification engine of this hybrid model does
not need to be train with the original training set, but with a pruned training
set which insufficient training samples have been eliminated. This is to ensure
a faster training process for SVM with a smaller training set, without sacrificing
the sufficiency of classification performance. For the pruned training set,
pair-wise distances between the training samples are computed and the distance
matrix will be converted to kernel matrix as the training data for SVM (Zhang
et al., 2006). The general idea of this work is to train the SVM
until it preserves the distance function on the set of neighbors and to find
close neighbors to an input sample. Algorithm of the proposed SVM-KNN has been
summarized in Table 1. While performing the task of classifying
visual objects, SVM-KNN focuses on two major elements of visual object recognition,
shape and texture. Several successful distance functions have been used in this
study, such as X2 distance, marginal distance, tangent distance,
shape context based distance, geometric blur based distance (L2)
and kernelizing the distance (Zhang et al., 2006).
Table 2: |
Summary of experimental results of SVM-KNN hybrid classification
model on benchmark datasets error rate for each distance function |
 |
According to the authors, their research is very similar to local learning
by Bottou and Vapnik which using the same idea but difference distance formula
(Zhang et al., 2006). The experiments and evaluations
on the proposed SVM-KNN hybrid classification model using some standard benchmark
datasets indicated that SVM-KNN hybrid model performed relatively well on MNIST,
USPS, CUReT and Caltexh-101 datasets, as compared to SVM and KNN classifiers
alone, as well as DAGSVM and HKNN classifier (Zhang et
al., 2006). In MNIST dataset, it contains of 60,000 training samples
and 10,000 testing samples. During the testing stage, Geometric blur based distance,
L2 and shape context distance have been used to compare SVM-KNN with
NN alone. USPS dataset consists of 7291 training samples and 2007 testing samples.
Two types of distance, L2 and tangent distance are used in comparison
between SVM-KNN, NN, DAGSVM and HKNN. For CUReT dataset, training data and testing
data are chosen randomly from 61 real world textures. Based on the experiments
conducted on the datasets mentioned above, the results show that SVM-KNN has
obtained the higher classification accuracy as compared to NN and DAGSVM.
The Caltech-101 is the most challenging dataset among all the tested datasets
because it contains many poses, lighting and variations of colors (Zhang
et al., 2006). Formula for distance calculation used in Caltech-101
is defined as Eq. 11 and 12:
With refer to Eq. 10 and 11, DA (IL,
IR) is the distance between left and right image.
represents
the ith feature of the left image and
is
the ith feature of the right image. Both
and
are
computed from geometric blur features.
stands
for left image output for ks filter, respectively for
(Zhang
et al., 2006). From the experimental results, the authors found out
that their proposed SVM-KNN hybrid classification model has higher classification
accuracy as compared to NN and DAGSVM.
The experimental results illustrated in Table 2 show that
SVM-KNN hybrid model has the lowest error rates compared to the classification
methods mentioned above. Figure 8 shows that the proposed
SVM-KNN hybrid classification model has a trade-off between time consumption
and classification error rate, where the classification error rate is high when
the time consumption is low and vice versa. According to the researchers of
this work, SVM-KNN hybrid model can overcome the drawback of conventional SVM
which is less efficient especially when the problem domain becomes intractable
(Zhang et al., 2006).
Gene expression data classification using SVM-KNN classifier by Shen
and Lin (2004): Shen and Lin (2004) presented
a new binary classifier that combines Support Vector Machines (SVM) with k-nearest
neighbor (KNN) algorithm for gene expression data classification in order to
improve disease diagnosis. Conventional SVM classification produces lower classification
accuracy if the sample data of the two classes in a binary classification are
all close to the separating hyper plane. When conventional KNN classification
has been applied to perform gene expression data classification, KNN distance
functions are anticipated to have poor classification performance as the dimensionality
of the noisy data increases (Shen and Lin, 2004).
|
Fig. 8: |
Trade-off between speed and accuracy of SVM-KNN in the case
of texture classification (Zhang et al., 2006) |
|
Fig. 9: |
Feature space with training samples of SVM with three misclassified
samples (two white rectangles and one white triangle) (Shen
and Lin, 2004) |
The KNN and SVM hybrid classification system proposed in this work, coined as KSVM, is developed due to the fact that conventional SVM is not able to handle the situation when the training samples are close to the optimal separating hyper plane. In such a situation, misplacing of training samples into another class could happen, as illustrated in Fig. 9. In Fig. 9, black triangles represent data points from category AML while black rectangles represent data points from category ALL and the two white rectangles and the white triangle are the data points which have been misclassified.
|
Fig. 10: |
Distance from sample x to hyper plane determines which algorithm
should be used to classify x. (Shen and Lin, 2004) |
During the classification phase, the KSVM algorithm calculates the distance from the testing sample to the optimal separating hyper plane in feature space. NN classification is a distance based approach in which Euclidean distance is typically used in order to calculate the distance between sample data. However, in this proposed KSVM method, distance between sample data is computed using Eq. 13 as illustrated below:
According to the algorithm of KSVM approach, should the distance be greater than a given threshold x, the input sample data will be classified using SVM, otherwise classification is performed by using KNN. The classification algorithm of the proposed KSVM approach is as shown in Table 3 and Fig. 10.
Shen and Lin (2004) have compared the performance of
KSVM with KNN and SVM individually in the experiments conducted in this research
work. Leukemia dataset and Colon dataset have been used to conduct the experiments
in order to evaluate the performance of their proposed KSVM classification approach.
Leukemia dataset consists of 38 training data and 35 testing data. On the other
hand, Colon data set has 31 samples, for both training set and testing set.
According to Shen and Lin (2004), there exist two feature
selection methods for gene selection problem, which are signal-to-noise (S2N)
and Recursive Feature Elimination (RFE) (Shen and Lin, 2004).
In the experiments conducted in this work, they have selected S2N feature selection
technique and have computed the following statistics as shown in Eq.
14:
The symbol σ+(j) and σ-(j) represent the standard deviation for the two classes for the jth gene. jth gene of classes +1 and -1 are represented by μ+(j) and μ-(j), respectively. The experimental results obtained in this research work are illustrated in Table 4.
According to the experimental results obtained by Shen
and Lin (2004), KSVM classifier has produced a higher accuracy as compared
to conventional SVM and conventional KNN. Based on their findings, the better
classification performance of KSVM classifier then conventional SVM classifier
is due to the fact that more information is carried after the training process
is performed, as compared to SVM. Besides this, the number of genes used in
the training process will not have high effect to KSVM classifier than the two
latter classifiers (Shen and Lin, 2004).
Hybridized KNN and SVM for gene expression data classification by Mei et
al. (2009): Mei et al. (2009) proposed
a hybridized algorithm that uses both k-nearest neighbor (KNN) and Support Vector
Machine (SVM) classifiers, coined as HKNNSVM, to analyze gene expression for
cancer classification, in both binary and multi-class categorization. They mentioned
that precise diagnosis and classification is the ultimate goal for successful
treatment of illness.
Table 5: |
Steps of HKNNSVM hybrid algorithm in training stage |
 |
Table 6: |
Steps of HKNNSVM hybrid algorithm in testing stage |
 |
The proposed HKNNSVM is developed in order to improve the conventional SVM
performance and to solve the problem of overlapped boundary region (Mei
et al., 2009).
The flow chart of the proposed HKNNSVM algorithm is depicted in Fig.
11. KNN is used as the preliminary classifier to prune the training samples
and eliminate those insufficient samples effectively in order to reduce the
training size for better classification. The second KNN classifier, coupled
with SVM, is used to classify the input samples based on the remaining training
samples which are not pruned away from the first KNN. Both KNN classifiers use
Euclidean distance formula to calculate the distances between each pair of samples
(Mei et al., 2009). The steps for training process
of HKNNSVM are as listed in Table 5 and the steps for classifying
process are shown in Table 6.
KNN algorithm is used to reduce the number of training samples by computing
the distance matrix (symmetrical matrix containing the Euclidean distance between
each pair of samples). Then, the k nearest neighbors for each sample is sought.
If the category of a training sample is same as the label of the majority of
its k nearest neighbors, the training sample is reserved, else it will be eliminated.
As the outcome of the pruning process, a reduced training set is obtained and
it is then used to train the KNN classifier for further classification. During
the classifying stage performed by the KNN classifier, if the k neighbors of
the input sample are all in the same category, then the input sample is assigned
to that particular category. On the other hand, if all the k neighbors of the
input sample are not from the same category, the hybrid classification approach,
HKNNSVM will calculate the pair-wise distances between the k neighbors and use
the computed distance matrix as the input to the SVM. After the classification
process performed by SVM, the input sample is assigned with the category resulted
from the SVM classification.
Three benchmark datasets for gene expression classification, which are Colon
dataset, Estrogen dataset and Acute Lymphoblastic Leukemia dataset, have been
used by the authors in order to conduct the experiments for evaluating their
proposed HKNNSVM classification approach. Colon dataset consists of 50 training
samples and 12 testing samples, meanwhile Estrogen dataset has 40 training samples
and 20 testing samples. For Acute Lymphoblastic Leukemia dataset, the authors
have selected 208 training samples and 40 testing samples. Based on their experimental
results, the authors claimed that HKNNSVM has obtained the lowest classification
error rate as compared to SVM, KNN and NN-SVM when they are used independently
from one another (Mei et al., 2009). Table 7 summarized
all the experimental results of each dataset that contained mislabeled samples.
Misclassification rate for the testing set was found to be lower with samples
pruning used than that of KNN and SVM, which indicated the stability of the
classification performance of HKNNSVM algorithm. From the experimental results,
the authors conclude that, HKNNSVM can perform better than SVM, KNN and NN-SVM,
by having the prune function which can eliminate the mislabeled training samples
effectively.
Instance-based spam filtering using SVM nearest neighbor classifier by Blanzieri
and Bryl (2007a): Enrico Blanzieri and Anton Bryl had presented a hybrid
classification model, coined as SVM-NN, which combines Support Vector Machines
(SVM) classification approach and Nearest Neighbor (NN) classification approach
and had conducted an evaluation on the performance of SVM-NN in handling the
task of spam filtering. The proposed SVM-NN classifier is the combination of
SVM-based decision rule and k-nearest neighbor (KNN) classifiers. The aim of
this research work is to evaluate the performance of SVM-NN algorithm in performing
unsolicited bulk email filtering, as spam has become one of the serious problems
of Internet since some decades now and spam has no proper uniformity but contains
messages that differ greatly in terms of topic, format and content (Blanzieri
and Bryl, 2007a).
The proposed SVM-NN hybrid classification model leads to a bigger margin, hence results a lower generalization error bound. The dataset is first classified using KNN classification approach. If the dataset cannot be categorized, dataset will then be classified using SVM. In the experiment, Euclidean metric is used to determine the nearest neighbors and the linear kernel for SVM. The steps for this hybrid algorithm are shown in Table 8.
In order to classify the sample, SVM-NN hybrid classifier finds out the k samples
in the training set and SVM classifier is trained based on these k samples.
During the classifying stage, the trained SVM classifier is used to classify
the unlabeled input sample. The combination of SVM and KNN is able to obtain
a lower generalization error bound, produced by the combination of a greater
margin and smaller ball containing the points (Blanzieri
and Bryl, 2007a). A comparison experiment is conducted in order to evaluate
the performance of the proposed SVM-NN classifier, the conventional KNN classifier
and also the conventional SVM classifier. The experiments are carried out in
the way of ten-fold cross-validation on the SpamAssassin corpus, which is a
reproduction of one of the experiments by another research group (Blanzieri
and Bryl, 2007a).
Table 7: |
Results of misclassification rates for dataset contained
mislabeled samples (Mei et al., 2009) |
 |
The SpamAssassin corpus contains 4150 legitimate and 1897 spam messages. Two
variants of the proposed hybrid classification model were involved in the comparison,
which are SVM-NN (Full scale) and SVM-NN (Partial scale). Total Cost Ratio (TCR)
is used for parameters optimizations and it is defined as Eq.
15:
where, nS represents the total number of spam messages in the corpus. nL→S is the number of spam messages but classified as legitimate message and nS→L is the reverse way. Relative cost of misclassification is represented by λ.
From the experiments conducted by the authors in this work, we can found that
SVM-NN generally performs better than conventional SVM. SVM-NN outperforms SVM
in terms of error rates by having lower rate on false negatives rate (FN), as
well as false positives rate (FP). However, SVM-NN only outperforms SVM significantly
on low dimensional features space, but the outperformance is not significant
on higher dimensional features space. The relation of FN and the dimensionality
of feature space are shown in Fig. 12 while Fig.
13 illustrated the relation of FP and the dimensionality of feature space.
Based on the experimental results, the authors conclude that the major drawback
of SVM-NN is its high resources consumption, thus results low processing speed
and costly classification process, especially when the numbers of k increase.
In high dimensional features space, the outperformance of SVM-NN compared to
SVM is less significant, but it persists. This might due to the fact that the
SVM-NN approach has higher sensitivity to unrelated features, as compared to
SVM (Blanzieri and Bryl, 2007a).
Evaluation of the highest probability SVM nearest neighbor classifier with
variable relative error cost by Blanzieri and Bryl (2007b):
Blanzieri and Bryl (2007b) had conducted research work
to investigate the performance of combining two classifiers, which are Support
Vector Machines (SVM) and k-nearest neighbor (KNN) to perform the task of spam
filtering.
Unsolicited bulk email or E-mail spam has causing huge financial losses due
to misuse of time, storage and resources to promote online scam or fraud and
as well as advertised illegal goods. Hence, spam filtering is widely applied
to overcome the issue to email spamming. According to Enrico Blanzieri and Anton
Bryl, there exist some methods of spam filtering, such as SMTP path analysis,
human language technologies, Naïve Bayes, SVM and KNN.
The authors have proposed the integration of highest probabilities-support
vector machine-nearest neighbor, coined as HP-SVM-NN, as a classification method
for spam filtering. This proposed classification approach is able to outperform
the conventional SVM classifier and has achieved a lower generalization error
bound as compared to the conventional SVM (Blanzieri and
Bryl, 2007b). The authors combined the algorithms of SVM and KNN by first
selecting the parameter k for each sample x from a pre-defined set in
the neighborhood of the samples which must be classified. Then, the hybrid algorithm
trains the SVM model on k nearest labeled samples and uses the model
to perform classification on some given samples. This hybrid model classifies
the input data using the following probabilities, P(legitimates|x) and P(spam|x)
= 1-p (legitimate|x). The output from SVM model will be used to estimate and
calculate the probabilities. The highest probability value will be used to calculate
error for the negative answer and error for the positive answer. The pseudo
codes of HP-SVM-NN algorithm are as listed in Table 9.
In order to evaluate the performance of HP-SVM-NN hybrid classification approach,
authors had used SpamAssasin corpus which consists of 4150 legitimate messages
and 1897 spam messages. Messages are broken into message header and message
body by using a feature extraction method (Blanzieri and
Bryl, 2007b). Each of the messages will be represented by a vector of d-binary
features. The highest information gain for d-binary features will be selected
by using formula as illustrated in Eq. 16:
where, two classes of samples represented by c1 and c2, meanwhile fk represents a binary feature.
From the experiments conducted by authors, the performance of HP-SVM-NN has
been compared to the performance of SVM in terms of Receiver Operating Characteristic
(ROC) curves. The comparison shows that HP-SVM-NN outperforms the conventional
SVM, as illustrated in Fig. 14. However, the major drawback
of the HP-SVM-NN classifier is its low processing speed due to the fact that
it needs to calculate distances for each of the sample data, thus increasing
the accuracy of the classification at the price of computational intensive (Blanzieri
and Bryl, 2007b). The authors suggested building a filter by cascading the
model with a faster algorithm such as Naïve Bayes to classify a message
and HP-SVM-NN method is only used for final checking of those spam message labeled
by Naïve Bayes (Blanzieri and Bryl, 2007b).
LDA/SVM driven nearest neighbor classification by Peng et al. (2003):
Peng et al. (2003) proposed a locally adaptive
neighborhood morphing classification method to reduce bias caused by Nearest
Neighbors (NN) finite samples due to the curse of high dimensionality.
These severe partialities can happen in the NN rule, which if not properly dealt
with can strongly affect classification performance.
Local Support Vector Machines (SVM) learning was applied to produce neighborhoods
that are extended along less relevant feature spaces, particularly those along
the most influential ones (Peng et al., 2003).
As the result, the modification of the neighborhood will produce efficacy in
classification performance, while still maintaining the stability of class conditional
probabilities.
The neighborhood morphing NN algorithm, named MORF, has three stages, as described
in Table 10. Firstly, it uses Euclidean distance function
to compute a nearest neighborhood of points around the query.
Table 10: |
Steps of neighborhood morphing NN algorithm (MORF) |
 |
Table 11: |
Average classification error rates of competing classification
schemes on different datasets (Peng et al., 2003) |
 |
Then, a local linear SVM is developed to obtain the exponential feature weighting
scheme. Finally, the resulting weighting scheme is used to compute nearest neighbors
for classification (Peng et al., 2003).
The authors derive a neighborhood algorithm morphing with SVM to produce a neighborhood algorithm that has a better performance than conventional SVM. The authors proposed using WSVM, where W denotes the within sum-of squares matrix, to compute feature relevance, ri, with the formula as illustrated in Eq. 17:
where, Θ is the direction whose inner product with WSVM.
The effectiveness of their proposed algorithm was compared against other classification techniques using ten different datasets, namely Iris, Vote, Sonar, Ion, Liver, Hep, Cancer, Pima, OQ and Unstruct. The competing classification methods, such as MORF, LDAW, SVM-R, KNN, C4.5, MACHETE, SCYTHE and DANN, are carefully chosen and compared to determine the usefulness of the proposed method. The experimental results are shown in Table 11 and Fig. 15 for performance distribution.
Based on the experimental results as illustrated in Table 11
and Fig. 15, the authors conclude that MORF classification
model outperforms the other classifiers. The experimental results showed that
KNN had the worst classification performance and MORF, a hybridized algorithm
proposed in this work, had obtained the best result. MORF can minimize the bias
of NN which assumption for class conditional probabilities are locally statics
and constants (Peng et al., 2003). While it is
usually costly in building local linear SVMs for classification, the increase
in classification performance over KNN methods outweighs this additional rate.
|
Fig. 15: |
Performance distribution of competing classification schemes
on different datasets (Peng et al., 2003) |
The experimental results showed evidently that the MORF algorithm can potentially
improve the performance of classification over KNN and SVM-R, when used independently
from each other.
DISCUSSION AND CONCLUSION
A comprehensive review of Support Vector Machine (SVM), k-nearest neighbor (KNN) and several SVM-KNN hybrid classification models have been presented in this study. In our early stage of investigation on different machine learning classification schemes, we found that both KNN classification approach and SVM classification approach have been widely implemented in many real world applications due to their good performance which guarantee high classification effectiveness and efficiency. The KNN approach is outstanding with its simple training and classifying algorithms, due to the fact that KNN implements instant-based learning algorithm. Hence, it does not need construction of a feature space until it receives a command to classify an input sample. Besides this, it can manage multiclass data set sufficiently by requiring a small amount of training samples. Despite that, most distance metrics, such as that of nearest neighbors, are expected to become less sensitive as the dimensionality of the noisy data increases, thus limiting the performance of NN when applied to data classification. In other words, the performance of KNN classifier is severely degraded by noises. In many cases, KNN is often used as the preliminary classifier to prune the training samples and eliminate those samples effectively to reduce the sample size for better classification.
In general, SVM has been commonly reported as one of the discriminative classification schemes which are recognized to be more accurate than many of the other classification approaches. The good classification performance of SVM is due to the fact that it implements Structural Risk Minimization (SRM) principle that which entails finding a decision surfaces, called optimal separating hyper-planes, to guarantee the lowest true error. However, the classification performance of SVM is not very high when samples are close to the hyper-planes. The samples intermixed in another class or in the overlapped boundary region may cause the decision boundary too complex and may be harmful to improve the precise of SVM. Another issue of SVM classifier involves time-consuming optimization and computation of pair-wise distances. SVM classifier suffers from the problem of computational intensive due to the convoluted training and classifying processes, especially when the number of training samples and the number of features for each sample are huge.
Thus, in order to improve the efficacy of classification performance, many recent researchers focus their attention in combining the SVM and KNN methods to classify various types of data with high dimensionality (in terms of and number of training samples and number of features), such as gene expression, illness, visual object, as well as email spam filtering data.
The hybridized models could bring a number of benefits. In the cases that have been reviewed in this study, the hybridized models can be used for large and multiclass datasets, reasonably lower computational complexity in training and at run time and better classification accuracy. In such hybridized models, the KNN is typically used to reduce the size of training set. Hybrid models of SVM combines KNN can perform better and faster since SVM will not be trained with the entire training set. Also, there exists evidence that SVM-KNN hybrid models can be used to reduce bias caused by NN finite samples due to the curse of high dimensionality. These severe partialities can happen in the NN rule, which if not properly dealt with can strongly affect classification performance. In almost every instance, if not all, it had been concluded that SVM-KNN classifier has higher accuracy compared to SVM and KNN used independently, with a few reported that it was due to more information obtained after training is performed. Also, the number of classes used in the training process will not have high effect to the classifier than most other kinds of classifiers. Misclassification rate for the prediction or testing set was found to be lower with samples pruning used than that of KNN and SVM, which indicated the stability of the classification performance of the algorithm. However, the common but major drawback of most SVM-KNN classifier is low processing speed, due to the fact that it needs to calculate distance for each of the sample data, thus increasing the accuracy of the classification at the price of extra computation.
In conclusion, while it is usually costly in building a hybridized SVM-KNN method for classification, the increase in classification performance over conventional KNN and SVM methods outweighs this additional rate. If speed is not the primary concern, its classification performance makes it a promising choice of classifier due to its high stability in most applications.