For the past several years, the volume of e-mail traffic has indicated a considerable
increase in the number of phishing e-mails (Subramanian
and Ramaraj, 2007). Phishing e-mails depend on social engineering in order
to try and steal the financial account credentials or passwords of users. Phishing
e-mails are designed to appear from legitimate businesses and agencies and have
an embedded link which will redirect users to a fake Web site. This will then
trick them into providing vital information by requiring them to fill out a
form purportedly intended to update their account or billing information; the
phisher then uses the data obtained to steal from online customers (Zhen
et al., 2008; APWG, 2010).
Gartner (2007) found that approximately 3.6 million
computer users in the United States have experienced phishing attacks, with
the total losses amounting to approximately US$ 3.2 billion. In fact, the number
of user victims increased from 2.3 million in 2006 to 3.6 million in 2007, representing
an increase rate of 56.5%.
One of the most serious problems brought about by phishing is to harm electronic
commerce as it causes users to lose their trust in electronic business (Folorunso
et al., 2006; Lin, 2010). There is a wide
range of phishing e-mails ranging from the very simple to the very complex messages.
Moreover, phishing attacks are capable of deceiving even the cleverest users,
making it difficult to have a fixed rule in solving the problem. Therefore,
a system that evolves and can interact with any new type of phishing e-mail
is required. Toward this end, a number of studies have attempted various approaches
to address the issue of phishing.
The present study has proposed a novel idea related to a model based on the
evolving connectionist system (Kasabov, 2003). PECM model
is called the Phishing Evolving Clustering Method (PECM) which is designed to
work in online mode and has a high speed of classification method and one-pass
memory; it was adapted from the Evolving Clustering Method for Classification
(ECMC) (Song and Kasabov, 2003). In the current study,
PECM was shown to have a high level of accuracy with low False Positive (FP)
and False Negative (FN) rates compared with other approaches. Where FP denotes
non-phishing e-mails marked as phishing where as FN represents the misidentification
of a phishing e-mail.
Although, a large number of studies have focused broadly on phishing or Web phishing, comparatively few have examined phishing e-mail detection, particularly phishing e-mail detection and classification in an online model. As phishing e-mails represent the main gateway of phishing Web sites, by reviewed a set of papers discussing the various phishing e-mail methodologies used.
One of the main approaches in phishing e-mail detection and classification
is the machine learning technique that depends on supervised or unsupervised
learning techniques (Pugazhenthi and Rajagopalan, 2007;
Yang et al., 2009). The machine learning technique
depends on classifiers that attempt to create a map between the inputs to the
desired output depending on a specific function. The main rule of the classifier
is based on learning of several inputs or features to predict a desired output
(Christy and Thambidurai, 2006). Abu-Nimeh
et al. (2007) compared six classifiers related with the machine learning
technique for phishing detection and used 43 features to train and test the
six classifiers. The results indicated that the there are no standard classifiers
for phishing e-mail detection; for example, some classifiers have low FP levels
but have high FN levels such as the Logistic Regression classifier, which has
good FP results but has a high FN score.
Saberi et al. (2007) proposed a mechanism based
on three learning algorithms to detect phishing e-mails which depends on the
binary classification of a scam or non-scam e-mail. He built combinations between
three algorithms, namely, the K nearest neighbor, the Poisson theory and the
Bayesian theory and then used the merged results to propose a new mechanism
capable of enhancing accuracy. His proposed method had an accuracy of up to
94.4%, with the FP reaching up to 0.08%. Bergholz et al.
(2008) proposed arithmetic filtering capable of detecting new phishing messages
with different contents depending on the trained characteristic features of
the e-mails and author-trained e-mail features, which was conducted using Dynamic
Markov Chains algorithm. These approaches can reduce the memory consumed by
the system while reducing the number of FPs and FNs.
The approach by Islam et al. (2009) was related
to machine learning and depends on a built system consisting of a three-tier
classification to detect phishing e-mails. The methodology involved in this
technique depends on feature extraction and classification in sequential style,
after which the output is sent to the decision process. Therefore, if the e-mail
is misclassified by any tier (classifier), the last decision will be from the
final tier. This technique proved that sequential order classifiers including
Support Vector Machines (SVM), boosting (ada boost) and Bayesian (naïve
bayes) have the highest levels of accuracy reaching up to 97%.
Yearwood et al. (2010) suggested a new technique
based on the profiling of phishing e-mail and emphasized embedded hyperlink
information only by extracting 12 features signifying phishing e-mail; the data
were then divided into two classes. The extraction features are based on a binary
value which is 1 if the e-mail has these features and 0 otherwise. Classifier
algorithms such as SVM and adaboost are then used to classify e-mails.
Only a few approaches have used clustering for the classification of phishing
e-mails. Dazeley et al. (2010) proposed a method
that depended on the combined method between unsupervised and supervised classification
algorithms. In this method, independent clustering was first used to randomize
input data and then Consensus Clustering was developed by combining it with
independent clustering. Next, Consensus Clustering was used for training and
classifying the entire data set. This technique increased the speed of classification
and had better accuracy compared with the k-means algorithm (Dehuri
et al., 2006; Ranjan and Khalil, 2007). In
this study, which proposed a new model able to distinguish between phishing
email and ham email in online mode with high accuracy, it can learn continuously
without consuming too much memory because it works on a one-pass algorithm.
It has the capability to use a shorter period of e-mail data capturing to change
its profile if the characteristics of phishing e-mail features have changed.
PECM proposed model has order steps as shown in Fig. 1. These order steps explain how the classification of e-mails in online mode into two classes, phishing e-mails and ham e-mails, is done. The model consists of three stages, namely, pre-processing, e-mail object similarity and application of the clustering technique Phishing Evolving Clustering Method (PECM). All of these stages work after the features of phishing e-mail utilized in our model have been determined.
Phishing e-mail features used in PECM: The PECM collects e-mails separately
and then filters them consecutively. The method depends on 16 features representing
he ost effective features of phishing e-mail. All of these features have been
proposed in earlier studies (Drake et al., 2004;
Fette et al., 2007), with some enhancements
on the feature extraction technique.
|| The overall phishing Email clustering approach-PECM
|| Phishing E-mail features used in PECM
The 16 features were implemented as a binary value (0 or 1), with a value of
1 flagging this feature as a phishing e-mail and 0 otherwise. The 16 features
are explained in Table 1.
Pre-processing: Pre-processing has two stages. The first stage is parsing
and stemming of e-mails. Parsing is used to extract the features of phishing
e-mails, whereas stemming is used to clean the text data integrated with the
features of a phishing e-mail. The second stage involves importing features
that represent phishing e-mails only, then translating the data to binary values
(1, 0), with a value of 1 indicating a phishing e-mail feature and 0 otherwise.
The processed e-mail data set to E-mail Object Similarity is then taken. In
this phase, the extraction features based on a series of short code scripts
are used to extract and analyze phishing e-mail features as explained in the
E-mail object similarity: This phase involves three processes as explained below.
Feature ranking and classification: The process of ranking features
is used to determine the most effective feature in the system. Our model used
Information Gain Ratio method (IGR) algorithms which work based on the extraction
of similarities between sets of e-mails and then give the highest weight to
the most effective features based on the class of phishing and ham e-mails belonging
to IGR (Mori, 2002), as explained in the following equations.
where, gain_r (X,C) represents the gain ratio of the feature X frequency in class C.
where, Ci and |Ci| denote the frequency of features X in class C, the i-th sub-class of C and the number of features in Ci, respectively.
Table 2 provides a complete ranking of all features selected for the proposed model. The more the information gain is, the more helpful a feature will be. By examination, html e-mail is the feature with the best quality, whereas diffsenlindom is the least helpful and possibly causes noise in the classifier.
Creation of crisp value: The creation of crisp value is used to convert the binary values (0, 1) of all e-mail data sets into crisp values by dividing all features on a 100 score based on this algorithm:
where, x is Crisp value, i is the feature number and GR is the information gain ratio. Then for each feature in the data set, each binary value is multiplied and the product is the crisp value, as shown in Table 3.
Grouping features e-mail similarity: This phase is intended to make
the data set simpler and faster in the classification processes.
|| Phishing e-mail features with IGR
|| Phishing e-mail groups features with crisp values
In our method, there are two groups of phishing e-mail features, with each
group consisting of summation eight features. The first group is the body features
and the second is the URL features. The 16 features are classified into two
and are stored on a database to be used in the PECM (Table 3).
Applying the clustering technique PECM: The PECM algorithm adapts ECMC
(Kasabov and Song, 2002; Song and
Kasabov, 2003) and classifies e-mails into two classes, ham e-mails or phishing
e-mails, in the n-dimensional input space based on evolving rule nodes. The
rule created from each node is connected with a class by a constant. These rule
nodes are sometimes related to the same class. There are two stages of PECM.
The first is the learning stage which involves the application of the ECM algorithm
on the data pairs (x, y), where x is the input vector value (features group
values in our model) and y is the output of input vectors. The input vectors
are dealt sequentially with a known class label in the learning stage. The order
steps of the learning phase are as follows:
||IF the system is learning all input vectors, then finish the
learning phase; ELSE enter a new input vector from the data source
||For any class, find all existing rule nodes related with this class as
the class of the input vector.
||IF this is the first time of the system, Then create a new rule but the
position of a new rule should be the same as the current input vector in
input space and the radius = Min-radius parameter; otherwise, Go to Step
||FOR each rule created before, IF the input vector does not belong within
the related field, increase this field if it has potential based on the
field having a radius of R0; the shortest distance between the
rule node and the input vector is d. The increased radius is Rnew
= (R0 + d) / 2
||IF the increase on the field which depends on whether a new field does
not contain any input vectors from the data set, is successful, Then the
rule node changes its location and the field increases; otherwise, the system
will not change both its rule node and field
||IF the input vector belongs to a related field, go back to step 1; ELSE,
a new node is created. Go back to Step 1
||End of the learning process. The procedure of learning takes one iteration
only but all input vectors may change their position many times in input
The second stage is the classification of new input vectors and involves two steps.
A new input vector (phishing e-mail features values) is entered, then the distance between input vectors with all rule nodes is calculated. IF the input vector is contained by a field of one or more rule nodes related with one class then the input vector will belong to this class. IF the input vector does not belong to any field, then the input vector will go to the nearest rule node.
IMPLEMENTATION AND TEST RESULTS
Data set: The data set used for the assessment of the proposed model
was taken from two sources. A sample set of 4,550 phishing messages received
from November 2004 to August 2007 was provided by the monkey Web site (http://monkey.org/%7Ejose/wiki/doku.php?id=PhishingCorpus).
The second source is the ham corpora from the SpamAssassin project (http://spamassassin.apache.org/publiccorpus/),
which consists of both easy and difficult messages comprising 4150 ham e-mails.
The ham corpus was collected from 2002 and 2003 and the most recent update was
in January 2006.
In the proposed method, the newest 2000 e-mails were selected from each ham and phishing e-mail class. Through the extraction of hyperlinks, e-mails that did not include any hyperlink information were removed. The final data set of the two classes comprised 4000 e-mails.
Experimental analysis: We used MATLAB version 7.10, in PECM for the computation and analysis. In our test, we m ixed the two classes, with one class consisting of 2,000 phishing e-mails and the second class consisting of 2,000 ham e-mails. The mix was ordered serially and the data were divided into two parts: the first with 3,000 samples input for the learning data set and the second with 1,000 samples input for the testing data set. The parameters for PECM were maxfield = 0.1, minfield = 0.01 and one epoch, where maxfield is the maximum radius of the cluster and minfield is the initial radius when a new cluster is created (new rule). The accuracy of the learning phase was 100%, whereas the accuracy of the testing phase was 99.7%. The model used rule extraction in the learning phase to obtain the results in the testing phase. Ours has a few rules reaching up to 24 rules and is capable of classifying a large numbers of e-mails at a high speed without consuming much time in online mode. Some of the rules are shown below.
An example of rules was extracted by PECM:
||IF Centre is [0.53, 0.24] and Radius is 0.10 Then Class is
 223 Samples in Cluster
||IF Centre is [0.70, 0.16] and Radius is 0.10 Then Class is  45 Samples
||IF Centre is [0.05, 0.08] and Radius is 0.05 Then Class is  1097 Samples
||IF Centre is [0.51, 0.05] and Radius is 0.03 Then Class is  403 Samples
||A comparison between our models with other Approaches for
phishing email detection
||PECM (on-line, one-pass) model-Learning samples
||PECM (on-line, one-pass) model-testing samples Testing samples
We also compared the results of our model with those of other approaches to
prove that our model has the best results in terms of FPs, FNs and accuracy.
Notably, FP denotes non-phishing e-mails marked as phishing, whereas Fns represent
missing a phishing e-mail as shown in Table 4. Figure
2-5 show some of the figures denoting the accuracy of
the training and testing phases in PECM in 2d space.
In Fig. 2, PECM shows the level of accuracy between the actual and desired results in training 3000 samples of data has two values (phishing email or ham email) in On-line mode by 2D space.
In Fig. 3, PECM shows the level of accuracy between the actual and desired results in testing 1000 samples of (phishing email and ham email) in Online mode by 2d space.
In Fig. 5, PECM shows the level of general accuracy, by two columns represent two class (number 1 represent phishing email and number 2 as ham email ) the level of accuracy from 0% up to 100%.
||The groups of phishing email and ham email by adaptive PECM
||PECM class accuracy where class 1= phishing email samples
and class 2 = ham email samples
CONCLUSIONS AND FUTURE WORK
PECM is a clustering-based learning algorithm that uses clusters of different shapes to distinguish between phishing and ham e-mails in online mode. We used a new technique for extracting features depending on the suggestion that all features have a binary value of either 0 or 1. The approach uses a new incremental clustering algorithm adapted for this purpose and depends on the (MaxDist) between the input vector and the cluster center for classification and building a new rule. The experiments proved that our model has good performance and high accuracy compared with other learning algorithms. PECM works in online mode, making it potentially useful in real time. For future works we suggest the use of online and offline clustering methods to build a system that can work in real time in order to ensure higher accuracy and better performance.
This research is supported by National Advanced IPv6 Centre of Excellence (NAV6) Universiti Sains Malaysia (USM).