Abstract: Data Mining refers to extracting knowledge from large amount of data. The Discovery of interesting association relationships among huge amount of data will help marketing, decision making and business management. Previous studies on mining association rules find rules at single level. However mining association rules at multiple-level may lead to the discovery of more specific and concrete knowledge from data. In this research we propose a new algorithms for mining multi-level association rules in large databases. It uses concept of counting inference approach that allows performing as few support counts as possible. This new method reduces database scan at each concept level than the other existing algorithm for multi-level association rule mining from large databases.
Introduction
Association rule mining finds interesting association among a large set of data items. With massive amount of data continuously being collected and stored. Many industries are becoming interested in mining association rules from their databases. The discovery of interesting association relationship among huge amount of business transaction records can help in many business decision making process, such as catalogue design, cross marketing and loss leader analysis.
Studies on mining association rules have evolved from techniques for discovery of functional dependencies (Mannila and Raiha, 1987), strong rules (Piatetsky and Shapiro, 1991), classification rules (Han et al., 1993; Quinlan, 1992), causal rules (Michalski and Tecuci, 1994), clustering (Fishae, 1987), etc. to disk-based, efficient methods for mining association rules in large sets of transaction data (Agrawal et al., 1993; Agrawal and Srikant, 1994, 1995; Park et al., 1995). However, previous work has been focused on mining association rules at a single concept level. There are applications, which need to find associations at multiple concept levels. For example, besides finding 80% of customers that purchase milk may also purchase bread, it could be informative to also show that 75% of people buy wheat bread if they buy 2% milk. The association relationship in the latter statement is expressed at a lower level but often carries more specific and concrete information than that in the former. This requires progressively deepening the knowledge mining process for finding refined knowledge from data. The necessity for mining multiple level association rules or using taxonomy information at mining association rules has also been observed by other researchers (Agrawal and Srikant, 1994). To confine the association rules discovered to be strong ones, that is, the patterns which occur relatively frequently and the rules which demonstrate relatively strong implication relationships, the concepts of minimum support and minimum confidence have been introduced (Agrawal et al., 1993; Agrawal and Srikant, 1994). Informally, the support of a pattern A in a set of transactions S is the probability that a transaction in S contains pattern A and the confidence ofA →B in S is the probability that pattern B occurs in S if pattern A occurs in S. For mining multiple-level association rules, concept taxonomy should be provided for generalizing primitive level concepts to high level ones. In many applications, the taxonomy information is either stored implicitly in the database, such as Wonder wheat bread is a wheat bread which is in turn a bread, or computed elsewhere (Han et al., 1993). Thus, data items can be easily generalized to multiple concept levels. However, direct application of the existing association rule mining methods to mining multiple-level associations may lead to some undesirable results as presented below.
First large support is more likely to exits at high concept level, such as milk and bread, rather then at low concept levels, such as a particular brand of milk and bread. Therefore, if one wants to find strong association at relatively low concept levels, the minimum support threshold must be reduced substantially. However, this may lead to the generation of many uninteresting associations, such as toy →2% milk before the discovery of some interesting ones, such as Dairyland 2% milk →wonder wheat bread, because the former may occur more frequently and thus have larger support then the latter.
Second, it is unlikely to find many strong association rules at a primitive concept level, such as the association among particular bar codes, because of the tiny average support for each primitive data item in a very large item set. However, mining association rules at high concept level may often lead to the rules corresponding to prior knowledge and expectations (Klemettinen et al., 1994), such as milk →bread, or lead to some uninteresting attribute combinations, such as toy →milk. In order to remove uninteresting rules generated in knowledge mining processes, researchers have proposed some measurements to quantify the usefulness or interestingness of a rule (Piatesky-Shapiro et al., 1994) or suggested to put a human in the loop and provide tools to allow human guidance (Borgida and Brachman, 1993). Nevertheless, automatic generation of relatively focused, informative association rules will be obviously more efficient than first generating a large mixture of interesting and uninteresting rules. These observations lead us to examining the methods for mining association rules at multiple concept levels, which may not only discover rules at different levels but also have high potential to find nontrivial, informative association rules because of its flexibility at focusing the attention to different sets of data and applying different thresholds at different levels.
Most of the previous studies on mining multi-level association rules, such as (Han and Fu, 1999); adopt an Apriori approach, which required more number of operations for counting pattern supports in the database. In this study, proposed a top-down progressive deepening method by extension of some existing algorithms for mining multi-level association rules. The method first finds large data items at the top-most level and then progressively deepens the mining process into their large descendants at lower concept levels. In this new approach call PASCAL algorithm (Bastide et al., 2000) at each concept level for mining all frequent patterns of that level. Due to using concept of key pattern it reduces database passes at each concept level.
Definitions and Concept
We assume that the database contain 1) an item data set which contain the
description of each item in I in the form of (Ai, description), where
Ai ∈ I and 2) a transaction data set,
Definition 1
A pattern A, is one item Ai or a set of conjunctive items Ai
Λ...... Λ Aj where Ai.... Aj
ε I. The support of a pattern A in a set S, σ (A/S), is
the number of transaction (in S) which contain A versus the total number of
transactions in S. The confidence of A →B in S, φ (A →B/S), is
the ratio of σ (A Λ B/S) versus σ (A/S) i.e.,
the probability that pattern B occurs in S when pattern A occurs in S.
To find relatively frequent occurring patterns and reasonably strong rule implications, a user or an expert may specify two thresholds: minimum support, σ and minimum confidence, φ. Notice that for finding multiple-level association rules, different minimum support and/or minimum confidence can be specified at different levels.
Definition 2
A pattern A is large in set S at level l if the support of A is no
less then its corresponding minimum support threshold σl.
A rule A →B/S is strong if, for a set S, each ancestor (i.e., the corresponding
high-level item) of every item in A and B, if any, is frequent at its corresponding
level A Λ B/S is frequent (at the current level) and the confidence
of A →B/S is no less then minimum confidence threshold at the current level.
The definition implies a filtering process which confines the pattern to be
examined at lower level to be only those with large support at their corresponding
high level. Based on this definition, the idea of mining multiple-level association
rules is illustrated below.
Example 1
Let the query be to find multiple-level strong association in the database
in Table 1 for the purchase patterns related to category,
content and brand of the food which can only be stored for less than three weeks.
For answering the above query the relevant part of the sales item description relation in Table 2 is fetched and generalized into a generalized sales item description table, as shown in Table 3, in which is tuple represent a generalized item which is the merge of a group of topples which share the same values in the interested attributes for example, the tuple with the same category, content and brand in Table 1 are merged into one, with their bar codes replaced by a bar-code set. Each group is then treated as an atomic item in the generation of the lowest level association rules. For example, the association rule generated regarding to milk will be only in relevance to (at the low concept levels) brand (such as Dairyland) and content (such as 2%) but not to size, producer, etc.
The taxonomy information is provided implicitly in Table 3. Let category (such as milk) represent the first-level concept, content (such as 2%) for the second level one and brand (such as Foremost) for the third level one. The table implies a concept tree like Fig. 1.
The process of mining association rules is expected to first discover large
patterns and strong association rules at the top-most concept level. Let the
minimum support at this level be 5% and the minimum confidence be 50%. One may
find the following: a set of single large items (each called a large 1-itemset,
with the support ratio in parentheses): bread (25%), meat (10%), milk (20%),
vegetable (30%), a set of pair-wised large items (each called a large 2-itemset):
(vegetable, bread (19%)....}, etc. and a set of strong association rules, such
as, bread →vegetable (76%)....
Table 1: | A sales transaction table |
Table 2: | A sales_item (description) relation |
Table 3: | A generalized sales_item description table |
Fig. 1: |
A taxonomy of the relevant data items |
At the second level, only the transactions, which contain the large items at the first level, are examined. Let the minimum support at this level be 2% and the minimum confidence be 40%. One may find the frequent 1-itemsets: lettuce (10%), wheat bread (15%),.... and large 2-itemsets: (2% milk, wheat bread (6%),..... and strong association rule: 2% milk →wheat bread (60%)....., etc.
The process repeats at even lower concept level until no large patterns can be found.
A New Mehod for Mining Multi-Level Association Rules
A method for mining multiple-level association rules is introduced in this
section, which uses a hierarchy information encoded transaction table, instead
of original transaction table (Han and Fu, 1999). This is based on the following
consideration. First, a data mining query is usually in relevance to only a
portion of the transaction database, such as food instead of all the items.
It is beneficial to first collect the relevant set of data and then work repeatedly
on the task-relevant set. Second, encoding can be performed during the collection
of task-relevant data and thus there is no extra encoding pass required. Third,
an encoding string, which represents a position in a hierarchy, required less
bits than the corresponding object-identifier or bar-code. Moreover, encoding
makes more items to be merged or removed due to their identical encoding, which
further reduces the size of the encoded transaction table. Thus it is often
beneficial to use an encoded table although our method does not rely on the
derivation of such an encoded table because the encoding can always be performed
on the fly.
This new method uses concept of counting inference approach (Bastide et al., 2000) at each level rather then Apriori. It allows to perform as few support counts as possible. Using this method, the support of a pattern is determined without accessing the database whenever possible, using the supports of some of its sub-patterns called key patterns. For implementation of this method we used PASCAL algorithm (Bastide et al., 2000) that is an optimization of the Apriori algorithm. In PASCAL, frequent patterns are extracted in a levelwise manner: During each iteration, candidate patterns of size k are created by joining the frequent patterns of size k- 1, their supports are determined and infrequent ones are discarded. Using counting inference, if a candidate pattern of size k is a non-key pattern, then its support is equal to the minimal support among the patterns of size k- 1 that are its subsets. This allows to reduce the number of patterns considered during each database pass while counting supports and, even more important, to reduce the total number of passes. This optimization is valid since key patterns have a property that is compatible with the pruning of Apriori: all subsets of a key pattern are key patterns and all supersets of a non-key pattern are non-key patterns.
The above discussion leads to the following algorithm for mining strong multi-level association rules.
Algorithm 1
Find multiple-level large item sets for mining strong ML association rules
in a transaction database.
Input
(1)
Output
Multi-level large item sets.
Method
A top-down, progressively deepening process which collects large item sets
at different concept levels as follows.
Starting at level 1, derive for each level l, the large k-items
sets,
Here PASCAL procedure calls for generating all frequent k-itemsets (k ≥ 2) at each level l. The pseudo-code for PASCAL procedure as follows
Procedure
Pascal
Input
(1)
Output
All frequent k-itemsets at level l.
The algorithm starts with the empty set, which always has a support of 1 and which is a (by definition as Bastide et al., 2000) key pattern (step 1 and 2). In step 3 to 5 frequent 1-patterns are marked as key patterns unless their support is 1. The main loop is similar to the one in Apriori (steps 6 to 20). First, Pascal-Gen is called to compute the candidate patterns. The support of key ones is determined via a database pass (steps 9 to 13).
Then (steps 14-19) the traditional pruning (step 15) is done. At the same time, for all remaining candidate key patterns, it is determined whether they are key or not (step 16 and 17).
Explanation of Algorithm 1
According to Algorithm 1, the discovery of large support items at each level
l proceeds as follows. At level 1(1 = 1), the large 1-itemsets
For k>1 item set table at level l is derived as done in the PASCAL
algorithm [12], i.e., first compute the candidate set from
The large itemsets at level l,
Generation of Candidate Sets
In this section Pascal-Gen procedure have discussed which is used by PASCAL
algorithm for generating candidate sets at each concept level. The pseudo-code
of this procedure is as follows
Procedure
Pascal-Gen The candidate generation procedure of counting inference algorithm
(Bastide et al., 2000) is given below.
Input
Lk, the set of frequent k-patterns p with their support p.supp
and the p.key flag.
Output
Ck, the set of candidate k+1-patterns c each with the flag c.key,
the value c.pred_supp and the support c.supp if c is not a key pattern.
The way Pascal-Gen operates is basically known from the generator function Apriori-Gen that was introduced by Agrawal et al. (1993). In addition to Apriori-Gens join and prune steps, Pascal-Gen makes the new candidates inherit the fact of being or not a candidate key pattern (step 16) by using Theorem 2 discussed by Bastide et al. (2000) and it determines at the same time the support of all non key candidate patterns (step 19) by using Theorem 4 as discussed by Bastide et al. (2000).
Conclusions
We have extended the scope of the study of mining association rules among from concept at the same level of a hierarchy to concept of different level of hierarchy in multiple concept level and developed new method for mining multi-level association rules from large transaction databases. This new efficient top-down progressive deepening technique incorporate pattern counting inference approach which is based on the notion of key patterns of equivalence classes of patterns. At each level it allows to count from the dataset the support of some frequent patterns only, the frequent key patterns, rather than counting the support of all frequent patterns. Due to this, it reduces number of database passes at each concept level. When small portion of data are large items at each level, it save a substantial amount of processing. Thus it will be promising algorithms in this circumstance.