The accuracy of intrusion detection system is subject to unstable and time-consuming
detection algorithms. In recent years, intrusion detection technology can be
divided into two categories: intellectualization (Macia-Perez
et al., 2011) and distribution (Rehak et al.,
2009). With the development of electronic computer technology, researches
of the anomaly detection methods (Mabu et al., 2009)
play a much more significant role in intrusion detection systems. Based on statistical
learning theory, the intrusion detection system is capable of continuous learning
which enhances recognition rate and reliability.
In reality, when the network scale is large, the training time becomes too long to detect the incoming attacks. Besides of this main issue, the complexities of matrix operation will cause memory-hungry. In order to monitor traffic on a high-speed link continuously, a quick-response SVM classifier is needed. However, it is not practical for SVM to handle unlabeled data stream. For this reason, this study provides convenience and assistant of preprocessing and analyzing audit data for security auditor.
The difficulty in SVM-based IDS is that of optimizing the parameters from the training phase of the classifier. Because of this, an optimization algorithm based on grid search is proposed. By our use of the KDD CUP 99 data preprocessed as a source in these experiments, the related parameters were selected rapidly and appropriately. The test result showed that the classification accuracy ratio has been improved significantly.
PRINCIPLE OF SVM FOR MULTI-CLASSIFICATION
Support Vector Machine (SVM), a kind of machine learning algorithm, can efficiently
solve the classification issue. Since it can obtain better performance under
less sample training conditions, SVM possesses stronger generalized capability
(Keerthi et al., 2000). SVM had been widely used
in the small data sets and high dimension feature spaces due to its good applicability
and high efficiency, especially in abnormal intrusion detection.
SVM is a linear classification machine which was based on the premise that the problem is linearly separable. For two-class samples in training set, the linear separable function works perfectly if samples can be apart by a hyper-plane. But there are special circumstances which appears discrete samples.
Figure 1 indicates that it is impossible to separate the
white samples from the black ones which surround them by a straight line. The
classifier will fall into an infinite loop apparently. In order to obtain an
available optimal hyper-plane, the data should be logically mapped into a high-dimensional
feature space (Liejun et al., 2008). As Fig.
1 depicts, a dotted parabola is utilized to separate the two classes according
to the function value. Suppose that <x, z> is the inner product of original
features. While mapping to <Φ(x), Φ(z)>, the kernel function
can be defined as follows:
According to the Mercer Principle, different kernel function generates different classification machines.
|| The impartible data sample
RBF (Radial Basis Function) is one of the kernel functions and it can be defined as follows:
In related researches (Lijia et al., 2011; El-Kouatly
and Salman, 2008; Shuchun et al., 2011),
the RBF possess advantages of other kernel function for three reasons: Keerthi
proved that linear kernel, with one penalty parameter and RBF, with a penalty
parameter and a kernel parameter, can provide the same performance. Moreover,
it is hardly to distinguish the performance between Sigmoid kernel and RBF.
RBF is able to deal with samples in which relationship between Class Label and
feature is nonlinear.
Classification model might be severely affected by noise. An outlier will result in hyper plane dislocation. According to the largest interval law, the interval distance of hyper planes reduced by the deflected support vectors. To avoid this, the nonnegative Slack Variable ξi is introduced to largest interval and noise samples are partially allowed to maximize the interval distance. Here is the expression so-called soft margin separating hyper plane:
In expression 3, the target function value can be largely affected by parameter
C which is the weight of outliers. The larger the C value is, the bigger the
function value will be. Since the amount of outliers is controlled by accumulated
polynomial, most of the sample points obey the constraint. The corresponding
Lagrange formula can be defined as follows:
In expression 3, both αi and ri are Lagrange multipliers. The expressions of w and b can be derived using partial derivatives to parameters of this formula:
According to expression 6 and constraint expression 7, the anti-noise ability
can be improved and the affect of outliers can be reduced by looking for the
biggest interval. Aimed at the problem of multi-class classification, Hsu
and Lin (2002) approaches compared theoretical results with experimental
data. Some of the typical methods such as 1-vs 1, 1-vs-r and DAG-SVM are highly
effective. Centered on the thinking of two-class classification (Yang
et al., 2011), to obtain effective multi-class classifier, it is
necessary to pretreat the input data before training the classifier. After the
data preprocessing phase, parameter optimization will be the most important
part in this study.
DATA PREPROCESSING FOR INTRUSION DICTION MODEL
Security audit dataset, KDD 99, processed by Lee (Stolfo
et al., 2001) was considered as a noted dataset for intrusion detection
systems. Aiming at making evaluation (Muda et al.,
2011) on intrusion detection models, this audit dataset, provides TCP-dump-format
data recorded from a simulative local network, serve as training input to regulate
multi-class classifier based on SVM. There are 38 kinds of attacks in this text-format
file. Basic, content-based, 2-second-flow and host stream characteristics are
included in this dataset. Because of more than million records are included
in KDD dataset, 10% data are selected in this study.
The relative merits of data preprocessing are highly related to the precision
of classification (Gu et al., 2008). For some
unrecognized features are included in KDD dataset, such as the character data,
all features should be numeralized before the classification phase. For example,
the protocol type, of which the values are TCP, ICMP and UDP, are replaced by
3 integer values from 1 to 3. Similarly, numeralization replacement is implemented
for the third, fourth and forty-second features. The 40 sec feature can individually
serve for classification label. According to the rest of the dataset which includes
41 features, the metrics of features are different in standard. To avoid the
unbalanced data impact, a phenomenon that some of the smaller values are covered
by the larger values, it is essential to normalize the value of each attribute
from metrical to non-unit-related by the following formula:
In the expression 8 for normalization, ymax is range of the upper limit whereas ymin is range of the lower limit. xmax and xmin are maximum and minimum value of the original feature. With the input of x, the value from current record of the corresponding attribute, the output y is recorded as normalized value through calculation.
However, it is complicated for computer to calculate enormous matrix generated
from 10% KDD dataset after numeralization and normalization. To reduce memory
consumption in large scale computation, feature extraction (Al-Tameemi
et al., 2011) is utilized to accelerate convergence. According to
related research, rough set theory is proposed to extract KDD dataset which
can be indicated using the methods provided by ZAINA A (Zaina
et al., 2006). Six attributes are included in the extracted dataset:
3, 4, 5, 24, 32 and 33. To train the SVM multi-classifier, this experiment selected
2251 samples, including 1000 normal behaviors, 817 denial-of-service attacks,
23 probe behaviors, 37 user-to-root behaviors and 374 remote-to-local behaviors.
TRAINING SVM-BASED MULTI-CLASSIFIER WITH GS
Cross validation provides reliable basis for the evaluation and stability to
the trained multi-classifier. In this phase, SVM classification accuracy can
be greatly affected by the nuclear parameter σ in formula 2 and the error
penalty parameter C in expression 3. In order to obtain the best combination
of these parameters, an improved GA (genetic algorithm) was applied to parameter
optimization (Chen and Wang, 2008). To large-scale and
unbalanced datasets, such as KDD dataset, GA has provided a global optimal solution
depending on the fitness function. However, it is easy to get into the local
optimum combination because of the randomness of genetic variation algorithm.
To avoid the shortcoming of weak local search ability of traditional genetic
algorithm, a GS algorithm (Coarse-to-refined grid search) is proposed. The steps
are as follow:
||First, search at large range. Parameter C is selected from
a geometric series of which the range is 2-4 to 28
and the common ratio is 20.5. Parameter σ is selected from
a geometric series of which the range is 2-4 to 24
and the common ratio is 20.5. According to the best accuracy
in cross validation, the optimal combination of C and σ are
recorded as C* and σ*. The distribution map of parameters and accuracy
and three-dimensional one are shown in Fig. 2
||Distribution map of parameters and accuracy and that in three-dimensional
coordinate using rough GS; (a) Distribution map of parameters and accuracy
and (b) Distribution map of parameters and accuracy in three-dimensional
||Three-dimensional distribution map of parameters and accuracy
using refined GS
As Fig. 2a shows, the cross validation accuracies are distributed in the curves. The combination of C and σ is gradually converging to smaller areas.
||Then, search at small range according to the selected C* and σ*.
New parameter C is selected from a geometric series of which the range is
C*x2-1 to C*x21 and the common ratio is 20.125.
Similarly, new parameter σ is selected from a geometric series of which
the range is σ*x2-1 to σ*x2-1 and the common
ratio is 20.125. According to the best accuracy in cross validation,
new optimal combination of C and σ are recorded as C* and σ*.
The distribution map of parameters and accuracy and three-dimensional one
are shown in Fig. 3
As Fig. 2a shows, the distribution of cross validation accuracies are in close proximity to plain areas. In the process selecting C and σ, over-learning may arise (The multi-classifier runs with high accuracy in training dataset but poor accuracy in test dataset.). In order to avoid this phenomenon in highest plain area, minimum C is selected as best parameter C*, maximum σ is selected as best parameter σ*.
By using the 5% KDD dataset, selected and pretreated as test data, is utilized for classification. The Table 1 compares the SVM multi-classifier based on Grid Search Algorithm with that based on Genetic Algorithm.
Experimental result shows that the testing accuracy optimized by GS is better
then that optimized by GA and the best C and best σ selected by GS are
smaller than that selected by GA.
||The training time and cross-validation accuracy of GS and
GA during the parameter optimization phase
|| Grid search algorithm vs. Genetic algorithm
Although cross-validation accuracies are the same, the GS is advantageous
in both training and testing time. The training result is shown in Fig.
From the Table 1 and Fig. 4, we can see that the training time of GA-based multi-classifier is three times the training time of GS-based. So the intrusion detection model based on GS-SVM is an effective system, with faster reaction velocity, higher accuracy and stronger generalization.
This study focused on the One-Class SVM which was applied to the intrusion detection module. According to the scale of KDD dataset, the accuracy was strongly influenced by parameter C and σ. After sample numeralization and normalization, feature extraction based on rough set theoretical model was implemented to reduce memory consumption result from high dimensionality. In order to improve the generalization capacity of SVM, a parameter selection algorithm was proposed. Computation on simulation examples and comparison with genetic algorithm demonstrated that the GS doubled the speed of convergence in the cross validation phase. Moreover, the classification accuracy was improved by 0.2669%. For this reason, this paper recommended GS-SVM into intrusion detection system for the purpose of reducing the training time and enhancing the stability of the anomaly detection model.
This study is supported in part by the National Natural Science Foundation of China, Grant No. 60961002.