Abstract: Traditional spam filtering techniques based on email` s content used to implement text-related machine learning and classification. Nevertheless, the uncertainty of message`s content causes a performance bottle for machine`s discrimination and classification. Aiming at the deficiencies of traditional spam filtering methods, this study brings forward a vector supported classifier model based on the features of spammer`s behaviours. The evaluation results for real spam testing set show that the spam classifier based on features of spammer`s behaviour is capable of discriminating email`s type with pretty high accuracy. Meanwhile, the classifier model is of robust performance on noise data.
INTRODUCTION
Currently, common anti-spam techniques include key words filtering, black list or white list filtering, rule-based filtering, etc. Nevertheless, these methods mentioned above have many limitations and drawbacks such as existing high false positive ratio and false negative ratio, limited application conditions, no training or learning capability and so on (Zhen et al., 2005). Therefore, facing with more and more severe situation for anti-spam, to work on anti-spam algorithms with high performance is asked very urgently.
Figure 1 shows a spams life cycle model. According to this model, we induced that the key point to eliminate spam is how to destroy the step of spams delivery at server end or common users end. The most difficult point is how to distinguish spam on MTA or MUA correctly. As an information carrier, it is no difference for machine that an email is good or bad, real or faked.
Fig. 1: | Spams life cycle model |
As for human being, people usually depend on emails content to judge whether an email is spam or not. However, to identify an emails type based on content is often biased by each people. Lets have an example to illustrate this. Some popular commercial emails may be normal advisements and also may be faked ones in some cases. Those people who dont need them would take them as spam, but others who do need them would never mind about them. Therefore, to identify an emails type might be implemented easily by people but very hard for machine. Based on our former analysis, the subjective uncertainty to identify an emails type would bring great challenges to classifiers and filters which are based on emails content. No matter text cluster, content mining or other similar techniques, the common characteristic of these methods is to use key words or key sentences to discriminate emails type. Generally speaking, these key elements are not only appearing in spam but also appearing in legitimate ones sometimes. This characteristic would lead to filters false positive problem. On the other hand, some spammers who want to escape filtering would insert ASCII characters into key words or key sentence randomly. Such a content cheating trick is simple but effective which would lead to filters false negative problem. So, blur and redundant information which key words and key sentences might possess would be a big problem needed to solve in connection with content based filters. In Sarah Jane Delanys paper, she points out that concept drift existed universally in emails would result in big changes in the hidden context which can induce changes in the target concept (Delany and Ham, 2005). Obviously, concept drift may cause semantic uncertainty and incapability for key words learning under specific context. This problem would introduce further difficulties to content based filter. Another problem which content-based filter has to cope with is key words dictionarys refreshment and updating. As it is well known that content based filter would be out of date gradually alone with new key words emergence. So, the key words dictionary must be maintained and updated from time to time. Now, some anti-spam organizations like MAPS Spamhaus have provided key words stop list downloading service. China anti-spam organization, CCERT anti-spam team, also provides this kind of service (Ying, 2005). But the list requires users to maintain personally. Its not a very convenient work for every common user. In a word, to improve content based filters performance needs to overcome numerous drawbacks and bottlenecks.
This study tries to avoid many difficulties that content based filters have encountered and research on the spam categorization method from another point of view. We consider investigating the email categorization method based on spammers behaviors mode. After carefully model design and simulation tests, we found that the new categorization model is accurate and robust for spam discrimination.
BEHAVIOR FEATURES ANALYSIS
We collect a big volume of spammers behavior features from the Internet and organize them into a behavior features library. Among these behavior features, we divide them into three categories, namely e-mails time anomaly behaviors, e-mails sending anomaly behaviors and e-mails format anomaly behaviors. E-mails time anomaly behaviors represent delivering spam on an important date like holiday, the day of a virus out-breaking, etc. E-mails format anomaly behaviors are mainly concerned with the e-mail addresses like forged delivery address and over large carbon copy addresses, etc. e-mails format anomaly behaviors are related to e-mails layout and style like html formatted email body, null email title, etc. Table 1 shows part of these behaviors and corresponding instance.
According to RFC822 and the later revised RFC2822, Table 2 shows some message fields either in messages head or messages body mapped by those behaviours features shown in Table 1.
BEHAVIOR FEATURE SELECTION
To ensure online updatings efficiency for the discrimination boundary, choosing features vector with over-high dimensionality is not suitable, Meanwhile, in view of that spammers would change their behaviors to avoid filtering, the behavior features library would be expanding and growing. Aiming at this paradox, to meet the precondition of computing efficiency and trusted features extraction, we use Mutual Information (MI) features extracting algorithm (Pelletier et al., 2004). MI can well represent the statistic essentiality of a behavior and therefore can be utilized as a trusted approach for feature selection. To obtain a better discrimination boundary, MI algorithm can adapt to select the most important behaviors and discard less important behaviors dynamically.
Table 1: | Category list for spammers behavior features |
Table 2: | The mapping relation between behaviours and messages fields |
MI behaviors feature selection algorithm is described as follows:
• | Establish a message samples set. |
• | For each message m to create binary behavior feature vector Vi = {v1, v2..., vN,} where setting vj as 1 represents that m contains the behavior feature fj and otherwise setting vj as 0. |
• | For each fj to compute the mutual information value I. |
Where, stochastic variable X and Y are defined as follows.
• | According to all of the Ifj (X, Y) values sorted by descend order, pick out the first n behavior features to be mapped as a n-dimension feature vector. |
By utilizing this algorithm, compared with content-based feature selection, the volume of the selected behaviors features would be much smaller. The smaller volume of features is able to improve the efficiency of training and classification.
SVM CLASSIFICATION MODEL
Based on SVM theory, email classification objective is about to solve a two-class quadratic optimization question which can be described as f(x) = h (x)T β +β0. Subject to the bias and variance limitation, we need to achieve a fitted support vector discrimination boundary. The boundary should adapt to the classification models online updating. An applicable classification model should be well-balanced between bias and variance, for the two factors might lead to over-fitting or under-fitting problem. Figure 2 shows the relationship among bias, variance and discrimination boundary.
Solution of the convex quadratic optimization question is described as expression (1)
Fig. 2: | Bias-variance influence on SVM fitted boundary |
Subject to εi ≥
0, yi (h (xi)T β + β0
≥ 1 - εi ∀ i |
(1) |
Where:
εi | = | A relaxation factor, |
γ | = | An adjustable parameter. |
To ensure better testing error, we use normalized parameter γ. The normalized parameter can well balance bias and variance and ensure to achieve an optimum boundary.
By using basis function, we would obtain an expanded feature space. The classifier function can be formalized as follows:
(2) |
In related SVM literatures (Duda et al., 2003), there are three kinds of popular kernel functions available: namely multinomial function, radial basis function and sigmoid function. All of them satisfy Mercer condition.
(3) |
(4) |
(5) |
We are going to make related performance tests by applying the three kernel functions into the classifier function, respectively. The testing results in next section suggest that the classifier who uses radial basis function has the best performance. Here, we use two sets of random 2D samples to testify the SVM classifiers performance preliminarily. After the considerate comparison and regulation, we choose normalized parameter γ = 0.002 and γ = 0.006 to have the test. From the Fig. 3, we found that the classification boundary is very smooth and not oscillating for the possible over-fitting.
Fig. 3: | 2D SVM discriminant boundary generated by RBF kernel |
Table 3: | Basic statistics for sample sets |
According to the testing result, the ratio of wrong classified points is 1.25% in Fig. 3a and 1.37% in Fig. 3b. The results basically prove that the discrimination boundary is precise and stable on 2D testing samples. As for standard email testing, we use a email testing set collected from an authentic mail server mci.uestc.edu.cn. The scale of the email set is shown as Table 3.
We use training set as initial samples set. After accomplishing the extraction of features matrix under the condition of pre-treatment, we put it into the spam classifier for training and then obtain a hyper-plane which supports vector in high dimensional space. The classifier maps the hyper-plane back to primitive space and forms an authentic discrimination boundary. In practice, the discrimination boundary trained by trusted behaviours vector is correspondingly stable but not static. Alone with the new emails arrived, the classifiers boundary should be self adapted. For satisfying this applicable requirement, we propose a boundary self-adjusting algorithm.
• | Extracting new arrived emails behaviour features. Constructing new input vector and appending it into the initial vector set S. For each vector, let α1 = 0; |
• | Selecting b vector samples randomly from S to form a subset; |
• | Using G (β) to train vector set; |
• | Applying the model on all of the vector samples of S; |
If there exist samples which are not satisfied Karush-Kuhn-Tucker condition, then using these samples to replace all of the samples in β and corresponded αi; If its not converged, then go back to step 3.
Table 4: | Testing error comparisons under disturbing features input |
TESTING RESULTS AND ANALYSIS
The spam classifiers generalization performance is always involved with its predicting capacity for independent email testing samples. Testing error which is also called generation error is expected to predict error on independent testing samples (Zhang et al., 2004).
(6) |
Email filters based on machine learning are often concerned with its robust performance. For content related email filter, if disturbed words or sentence inserted, its classification performance would change evidently. But it would be no influence on behavior-based filter we proposed. So, we try to introduce some legitimate behaviors as disturbing factors to test the filters robust performance. Table 4 shows that all SVM based filters outperform other filters. Among those SVM based filters, SVM/RBF kernel filter is the best one, for it is most stable under introduced disturbing behaviors. On the other hand, Naïve Bayes filter has the poorest performance under disturbing condition.
Aiming at two-class classification problem like spam category, using cross-validation approach (OBrien and Carl Vogel, 2003) (data are shuffled and different parts are used for training and test in each iteration) could evaluate objectively how precise the spam filter is. Spams misclassification problem can be divided into two types,
Fig. 4: | Comparison for RFP and RFN |
namely false positive and false negative. The former means the filter recognizes the ham as spam. The latter, on the contrary, means the filter recognizes the spam as ham. Here, we give the FP ratio and FN ratios definitions (Androutsopoulos et al., 2000) as follows:
(7) |
(8) |
Where, NL and NS denote the volume of ham and spam, respectively. nL→s and ns→L denote false positive emails and false negative emails separately. From Fig. 4, alone with the input testing samples increased, all of the filters FP ratio and FN ratio is decreasing. SVM/RBF kernel filters FP ratio and FN ratio is decreasing much faster than others which suggests it has comparative less misclassifications than other filters. BP ANN filters FN ratio is less than FP ratio which suggests its performance is not well-balanced. Naïve Bayes filters performance is stable and balanced both on FP ratio and FN ratio. But its precision is not as good as SVM/RBF kernel filter.
CONCLUSIONS
The main contributions we made in this paper can be concluded as follows:
• | We investigated on spammers behavior profoundly and utilized MI algorithm to extract trusted behavior features vector based on a behavior features library. |
• | We brought forward SVM-based spam filter with normalized parameters which ensure well balanced bias and variance. According to the comparison tests among several popular filters under authentic email testing set, we can draw the conclusion that new email filter has high precision in email category and is robust to disturbing data. |
ACKNOWLEDGMENTS
We thank professor Tan Liang and doctor candidate Wang TieJun at Westone United Lab for their previous work on Chinese email testing sets collecting and organization. This work is supported by China national 863 subject foundation (Grant No. 2006AA01Z411).