Anomaly Detection in Transactional Sequential Data

Zhang, Jingwei; Lin, Yuming; Zhang, Huibing; Yang, Qing

Research Article

Anomaly Detection in Transactional Sequential Data

Jingwei Zhang
School of Computer Science and Engineering, Guilin University of Electronic Technology, Guilin, 541004, China

Yuming Lin
School of Computer Science and Engineering, Guilin University of Electronic Technology, Guilin, 541004, China

Huibing Zhang
School of Computer Science and Engineering, Guilin University of Electronic Technology, Guilin, 541004, China

Qing Yang
Electronic Engineering and Automation Institute, Guilin University of Electronic Technology, Guilin, 541004, China

ABSTRACT

A transactional system often has a steady distribution based on their decisions or responses on various transactions. If the system is changed, the distribution also often changes. It is a valuable work to detect anomaly caused by system changes based on different distributions. In this study, we modeled those decision or response signals into a series of time-related distributions and then proposed a method combining distance metrics and anomaly detection to discover whether changes have happened in some systems. Distance metrics on different distributions can decide whether changes have happened and anomaly detection can find what happened further. Extensive experiments show that our method has a good performance and can locate the anomaly accurately.

PDF Abstract XML References Citation

INTRODUCTION

For online transactions, whether a transaction will success are influenced by many factors. For example, you are using your cell phone for dialing, whether the transaction could continue depends on whether you have input a correct phone number, whether your account has enough money, whether the network is good, whether the receiver is in a region with signals and so on. Though a failure transaction maybe caused by systems themselves or other factors, a steady transactional system should have a steady distribution on its responses from a statistical perspective. In fact, the online transaction systems are always improved so that they can provide richer and more reliable services for their customers; it is also possible that some mistakes are left in those systems because of some casual actions. Those inadvertent mistakes are very difficult to discover since they only happen in some specific conditions and are not visible for all customers. For example, you maybe often get such message, You dialed cannot be connected, when dialing by your cell phone, obviously, this message cannot give us any help about how to do in the next step. Though those uncertain messages are inevitable, they should have a steady distribution in a large number of historical records. If some operators have a high probability on those uncertain messages suddenly, maybe the operators changed their systems.

It is an important problem for the third parties to monitor whether changes are made in the systems of any party involved in transactions. Anomaly detection on distributions is an effective mechanism for monitoring. In this study, we consider those response signals as indicators; all indicators should have a steady distribution if the corresponding systems are not changed for a large scale of transactions. A successful change on systems should let the systems provide better services and present a steady distribution, a bad change will cause the systems show unsteady indicator distributions. The problem of detecting system changes and system stability is transformed into computation of different indicator distributions.

Classification and clustering are two primary technologies for data analysis. Hannan et al. (2008) applied classification method on tire pressure and temperature data for intelligent vehicle performance evaluation. Hussain et al. (2008) provided a method to detect drowsiness through combining the Electrooculogram and Head Nodding signals, which can be analyzed and represented as feature vector of two states, alertness and sleepiness. Arora et al. (2009), Ranjan and Khalil (2007), Premalatha and Natarajan (2010) applied clustering methods on different data. Anomaly detection, which is similar to outlier detection, skyline query in some applications, such as network or intrusion detection (Thottan and Ji, 2003; Teodoro et al., 2009), privacy preserving (Vaidya and Clifton, 2004), Software bugs (Hangal and Lam, 2002) and so on. Elahi et al. (2009) researched how to detect outstanding outliers from datastreams by grid-based technology. Abe et al. (2006) proposed a selective sampling mechanism based on active learning for outlier detection, Stibor et al. (2005) and Gnozalez et al. (2002) used classification methods for anomaly detection and focused on negative selection algorithm. Song et al. (2007) made use of users’ domain knowledge to discover anomalies in which users are really interesting, this is a common goal with our paper. Chan and Mahoney (2005) provided some algorithms for anomaly detection on monitoring applications, in which time factor is considered. More information on anomaly detection is covered by Chandola et al. (2009). Uysal (2007) presented some applications and comparison of two models on the prediction of time series values, Radial Basis Function Networks and Autoregressive Integrated Moving Average models. Sharma et al. (2007) researched the interestingness measure to extract the interesting results from large number of classification results.

Present contribution is to model system stability as indicator distributions and use distance metrics and anomaly detection on different distributions to discover the changes of systems. Based on a large number of the indicators, our method can judge whether changes happened on corresponding systems.

PROBLEM SETTING

From statistically significant, the behaviors of a system can be represented as a series of distributions on their indicators. Given an indicator set I = {i₁, i₂, ..., i_n} and two distributions on I, D₁ and D₂, D₁ = {<i₁, p₁₁>, <i₂, p₁₂>, ..., <i_n, p_1n>}, D₂ = {<i₁, p₂₁>, <i₂, p₂₂>, ..., <i_n, p_2n>}:

D₁ and D₂ correspond to different system states and reflect different system characteristics. Generally, for the given indicator set, only one of them represents a successful indicator, such as i₁ and other failure. Given a time set T = {t₁, t₂, ..., t_m}, if all distributions are time-related, we have TD = {<t₁, D₁>, <t₂, D₂>, ..., <t_m, D_m>}, the aim of anomaly detection is to find abnormal distribution and locate the abnormal indicators based on a series of given distributions representing the stability of a system, which can be considered as strong evidence of bad system changes.

Problem definition: Given a set of time-related distributions TDS = {TD₁, TD₂, ..., TD_m}, anomaly detection is to find whether there is a distribution change on TDS at the time series {T₁, T₂, ..., T_m}. A distribution change on time series T means that there are two distribution sets DS₁ and DS₂, the time set related with DS₁ is T₁ and DS₂ corresponds to T₂, T = T₁∪T₂∧T₁∩T₂ = ø and they satisfy MAX (T₁) < MIN (T₂) or MAX (T₂) < MIN (T₁).

A distribution change can be also understood as a distribution transfer from a steady state to another steady state. If the second steady state is formed by some bad system changes, it is valuable to capture such state since the system should be corrected immediately. In fact, anomaly detection includes two stages, the first stage is to define some steady distributions on TD₁, TD₂, ..., TD_m, the second is to discover detailed anomaly through comparison between the steady distribution and current distribution TD_m+1. In the following part of this study, we will pay attention to those failure indicators since the successful indicator and the set of failure indicators are related.

Example 1: If a system has 17 indicators, which are expressed as A₀, A₁, A₂, ..., A₁₆. A₀ is the successful indicator and A₁, A₂, ..., A₁₆ are failure indicators. In Fig. 1a, there are 5 different indicator distributions, their related time is t₁<t₂<t₃<t₄<t₅, indicators A₁, A₄ and A₇are becoming high, which shows that the system may be changed. Figure 1b presents the comparison between baseline, which is established by historical distributions and current distribution TD₅, A₁, A₄ and A₇ are detected as anomaly since their distributions are higher than baseline’s.

BASIC AND IMPROVED METHOD

Distance metrics is used to judge whether a current indicator distribution is not consistent with the baseline distribution and anomaly detection is used to discover which indicator is bad.

There are two situations for anomaly detection, one is that the successful indicator has a great change and the second is the successful indicator has no apparent change. Usually, only the first situation, namely the good indicator has a sharp decline, will be concerned. In this study, we will also introduce distance metrics to detect the second situation.

Distance metrics: Distance metrics are often used to cluster the similar points in multi-dimensional space. The classical distance metrics include Dice distance, Jaccard distance, Cosine distance, Euclidean distance and so on.


Fig. 1(a-b):	An illustration of indicator distribution and anomaly detection, (a): Time-related indicator distributions, (b): A comparison of indicator distributions

These distance metrics have a wide range of applications, such as clustering analysis (Schenker et al., 2003), natural language processing (Lee et al., 2005; Dumais et al., 1998), text retrieval (Kim and Choi, 1999), image recognition (Salleh et al., 2011) and so on. Here we consider a distribution on a set of indicators as a point of multi-dimensional space, because all dimensions are in a same range, which are all probability, a value between 0 and 1, Euclidean distance is the best to measure their similarity. Given a set of such points P = {p₁, p₂, ..., p_n}, ∀_i, p_i is a distribution on a set of indicators, we assume that all points in P are normal, which means that these points are gathered when system is steady and has no anomaly, if a point p has a far distance from all points in P, we can say that there is anomaly in p.

In fact, it is difficult to find enough good distributions for anomaly detection, especially for the third parties since they cannot know whether a change has been made on systems. The usual situation is that we have a group of distributions, we need to distinguish those abnormal distributions from normal ones. In order to find which point is outlier, we use Euclidean distance to compute the distance between points and then group those points with small distance into a set, obviously, every set of points should belong to a steady distribution. Time information is checked to judge whether a system has a change. If we get two sets, S₁ and S₂, after distance computation, S₁ = {<t₁₁, p₁₁>, <t₁₂, p₁₂>, ..., {t_1n, p_1n}}, S₂ = {<t₂₁, p₂₁>, <t₂₂, p₂₂>, ..., {t_2m, p_2m}}, the time information related with them are T₁ = {t₁₁, t₁₂, ..., t_1n} and T₂ = {t₂₁, t₂₂, ..., t_2m}, if MAX (T₁) < MIN (T₂) or MAX (T₂) < MIN (T₁), we can say there is a change in corresponding system. Here, we only consider MAX (T₁) < MIN (T₂), if all failure indicators in S₂ have lower probability than S₁, we say that this is a successful change on corresponding system, our aim is to find those failed change on systems, which means that some indicators in S₂ have a higher probability than S₁.

The partition is done by distance computation, we firstly pick up the point pair with the maximum distance, the two points are considered as the first element of corresponding categories. All remaining points are judged their owners on their average distance with current categories, a point will be allocated to the set with minimum average distance. The average distance between a point p and a set S is defined as:

The detailed process for distribution partition is presented in algorithm 1.

Example 2: Table 1 illustrates five distributions on an indicator set {A₀, A₁, A₂, A₃}, which are {<t₁, D₁>, <t₂, D₂>, <t₃, D₃>, <t₄, D₄>, <t₅, D₅,>} t₁ to t₅ is an increasing time sequence.

According to algorithm 1, five distributions can be divided into S₁ = {<t₁, D₁>, <t₂, D₂>, <t₃, D₃>} and S₂ = {<t₄, D₄>, <t₅, D₅>}. If A₀ represents the successful indicator, there is a change on corresponding system since S₂ presents different distribution from S₁.

Algorithm 1:	Abnormal distribution detection

Table 1:	Distributions on an indicator set

Anomaly detection: According to distance metrics, we can only identify whether there are some changes on systems but we don’t know the change details, namely which indicator has an offset on a series of distributions. In this section, we will discover the change details through anomaly detection. In fact, distance metrics measure changes from multi-dimensional perspective and anomaly detection find change details based on concrete one-dimensional perspective.

We use distribution trends to detect anomaly before and after system change. Whether an indicator is abnormal depends on whether it has different distribution trends before and after system changes. Here, distribution trends are measured by baseline and fluctuation range. Baseline is represented by average and fluctuation range is represented by variance. Given two distributions of a same indicator, which correspond to two phases, before and after system changes, we can compute their average, base₁ and base₂ and variance, var₁ and var₂, if the average has an obvious increase or variance have different change, we can say that the indicator is abnormal. In fact, an abnormal indicator except successful indicator has two forms:

•	base₂>base₁+β, namely baseline has increased beyond the expected range after system change
•	, indicator presents an unsteady state

For the successful indicator, the first rule should be base₂<base₁+β and the second rule is right. Here, β and ε are threshold which are set experimentally. The detailed process is presented in algorithm 2.

Algorithm 2:	Anomaly detection

Improvement on distance metrics and complexity: Though a steady system often presents a steady distribution on its indicators, exception is inevitable, which means that a distribution seems to be abnormal but no system change, strict distance metrics are not very accurate to solve this problem.

Since, it is possible that a distribution before change is divided into the distribution set after change and vice versa. Here, we introduce probability threshold to solve this question, suppose time sets T₁ and T₂ are collected from two time-related distribution sets S₁ and S₂, S₁ corresponds to a distribution before system change and S₂ is a distribution after system change, if:

we still say that there is a change on corresponding system. Here λ is set empirically.

Another improvement is on computation complexity. Strict distance metrics need distance computation between any pair of distributions, whose time complexity is O (n²). Here, since it is most possible to have the maximum distance between two distributions that are at the starting and end of a time series, we can pick up these two points for computation to establish the initial categories. Especially, the pair distance can be pre-computed and stored in a hash structure, which will improve the partition speed.

EXPERIMENTS

Here, we carried out experiments to verify the efficiency and effectiveness of the proposed method.


Fig. 2:	Accurateness under steady and unsteady distributions

The experiments use two datasets, one is transformed from real transactional data and the other is generated by machine based on the former. All experiments run under Core Duo 2.2 GHz CPU and 2 G memory.

Accurateness: In order to verify the effectiveness of our method, we carried out a group of experiments on the first dataset, which is a transformed dataset on a real one. This dataset involves 6 institutions and 30 days’ transactions. The number of transactions in one day varies from tens of thousands to several millions and the numbers of indicators for them are from 30 to 40. Here, we assume that every institution corresponds to a system since we cannot know how many systems really exist in an institution. Every experiment is carried out in two phases, this first phase is to find whether there is change on all institutions and the second phase is to find the change details. We compare the accurateness on two situations, one is to test whether the proposed method will find anomaly under steady distribution, another is to test whether the proposed method can locate the anomaly accurately under some distribution after system changes. The experimental results are presented in Fig. 2. We can find that when all distributions are steady, namely no system change, the proposed method is perfect since it does not give any false positives. When checking whether there is anomaly under system change, the proposed method maintains a high accurateness above 90%.

Time performance: In this group of experiments, we expand the dataset to verify the efficiency of our method. An expanded dataset is used in this group of experiments, we expanded the number of institutions to 5000, 10000, 15000 and 20000, every of which still has 30 days’ transactions.


Fig. 3:	Computation time comparison between original and improved program

The number of indicators still varies from 30 to 40. In fact, the original program has a good performance but after improvement, the performance presents a better effect. The experimental results are shown in Fig. 3. The original program has a quicker rise on execution time than the improved when the number of institutions becomes larger. It is possible to consume more time when the number of transaction day increases greatly. In this dataset, we only expanded the number of institutions base on their distributions.

CONCLUSIONS

In this study, we consider a steady system as a steady indicator distribution, so a system change can be detected through distribution offset. All information returned by a system are modeled into an indicator distributions, through monitoring the indicator distribution, a system change can be discovered effectively. We use a two-stage method to locate system change, the first stage helps to decide whether there is a change on system, the second stage detect which indicator has an abnormal distribution. The proposed method can help to find system change effectively, which is very significant for quality analysis of online transactions.

REFERENCES

Abe, N., B. Zadrozny and J. Langford, 2006. Outlier detection by active learning. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2006, ACM Press, USA., pp: 504-509.
Vaidya, J. and C. Clifton, 2004. Privacy-preserving outlier detection. Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 1-4, 2004, Brighton, UK., pp: 233-240.
Stibor, T., P. Mohr and J. Timmis, 2005. Is Negative Selection Appropriate for Anomaly Detection. ACM, New York, pp: 504-509.
Gnozalez, F., D. Dasgupta and R. Kozma, 2002. Combing negative selection and classification techniques for anomaly detection. Proceedings of the Congress on Evolutionary Computation (CEC'02), May 12-17, 2002, IEEE Computer Society, Honolulu, USA., pp: 705-710.
Hangal, S. and M.S. Lam, 2002. Tracking down software bugs using automatic anomaly detection. Proceedings of the 24th International Conference on Software Engineering, May 19-25, 2002, ACM, New York, pp: 291-301.
Song X., M. Wu, C. Jermaine and S. Ranka, 2007. Conditional anomaly detection. IEEE Trans. Knowl. Data Eng., 19: 631-645.
CrossRef Direct Link
Chan, P.K. and M.V. Mahoney, 2005. Modeling multiple time series for anomaly detection. Proceedings of the 5th IEEE International Conference on Data Mining, November 27-30, 2005, IEEE Computer Society, Washington, DC, USA., pp: 90-97.
Thottan, M. and C. Ji, 2003. Anomaly detection in IP networks. IEEE Trans. Signal Proc., 51: 2191-2204.
CrossRef Direct Link
Garcia-Teodoro, P., J. Diaz-Verdejo, G. Macia-Fernandez and E. Vazquez, 2009. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Secur., 28: 18-28.
CrossRef Direct Link
Lee, C., H. Park and C. Ock, 2005. Significant sentence extraction by euclidean distance based on singular value decomposition. Proceedings of the 2nd International Joint Conference of Natural Language Processing, October 11-13, 2005, Springer-Verlag, Berlin Germany, pp: 636-645.
Schenker, A., M. Last, H. Bunke and A. Kandel, 2003. Comparison of distance measures for graph-based clustering of documents. Proceedings of the 4th IAPR International Conference on Graph Based Representations in Pattern Recognition, June 30-July 02, 2003, Springer-Verlag, Berlin, Germany, pp: 202-213.
Dumais, S.T., J. Platt, D. Heckerman and M. Sahami, 1998. Inductive learning algorithms and representations for text categorization. Proceedings of the 7th ACM International Conference on Information and Knowledge Management, November 2-7, 1998, Bethesda, MD., pp: 148-155.
Salleh, S.S., N.A.A. Aziz, D. Mohamad and M. Omar, 2011. Combining mahalanobis and jaccard to improve shape similarity measurement in sketch recognition. Proceedings of the 13th Internametion Conference on Computer Modelling and Simulation, March 30-April 1, 2011, IEEE Computer Society, Washington, DC, USA., pp: 319-324.
CrossRef
Kim, M.C. and K.S. Choi, 1999. A comparison of collocation-based similarity measures in query expansion. Inform. Process. Manage., 35: 19-30.
Sharma, S., S. Khare and S. Sharma, 2007. Measuring the interestingness of classification rules. Asian J. Inform. Manage., 1: 43-49.
CrossRef Direct Link
Hannan, M.A., A. Hussain, A. Mohamed and S.A. Samad, 2008. TPMS data analysis for enhancing intelligent vehicle performance. J. Applied Sci., 8: 1926-1931.
CrossRef Direct Link
Hussain, A., B. Bais, S.A. Samad and S.F. Hendi, 2008. Novel data fusion approach for drowsiness detection. Inform. Technol. J., 7: 48-55.
CrossRef Direct Link
Elahi, M., L. Xinjie, M.W. Nisar and H. Wang, 2009. Distance based outlier for data streams using grid structure. Inform. Technol. J., 8: 128-137.
CrossRef Direct Link
Arora, A., S. Upadhyaya and R. Jain, 2009. Integrated approach of reduct and clustering for mining patterns from clusters. Inform. Technol. J., 8: 173-180.
CrossRef Direct Link
Premalatha, K. and A.M. Natarajan, 2010. A literature review on document clustering. Inform. Technol. J., 9: 993-1002.
CrossRef Direct Link
Uysal, M., 2007. Comparison of ARIMA and RBFN models to predict the bank transactions. Inform. Technol. J., 6: 475-477.
CrossRef Direct Link
Ranjan, J. and S. Khalil, 2007. Clustering methods for statistical analysis of genome databases. Inform. Technol. J., 6: 1217-1223.
CrossRef Direct Link
Chandola, V., A. Banerjee and V. Kumar, 2009. Anomaly detection: A survey. ACM Comput. Surv., Vol. 41, No. 3.
CrossRef

Information Technology Journal

Research Article

Anomaly Detection in Transactional Sequential Data

ABSTRACT

How to cite this article

Search

INTRODUCTION

CONCLUSIONS

ACKNOWLEDGMENT

REFERENCES

Search

Related Articles

Leave a Comment