Subscribe Now Subscribe Today
Research Article
 

MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts



R.S. Thakur, R.C. Jain and K.R. Pardasani
 
Facebook Twitter Digg Reddit Linkedin StumbleUpon E-mail
ABSTRACT

The problem of efficiency is the main crux of most data mining problems, such as mining frequent patterns. This problem is mainly concerned with number of operations required for counting pattern supports in the large database. In this study we propose a Multi-strategy based new algorithm, which combines Pincer-search and counting Inference approaches for discovering maximum frequent patterns along with support count of all their subsets. This algorithm works in both directions, bottom-up as well as top-down. The main search direction is still bottom-up but a restricted search is conducted in the top-down direction. The important characteristic of the algorithm is that, it is not necessary to explicitly examine every frequent itemset. Counting inference allows us to perform as few support counts as possible. Using this method, the support of a pattern is determined without accessing the database whenever possible using the supports of some of its sub-patterns called key patterns. MAXFP method performs well even when some maximal frequent itemsets are long. It reduces cost of the frequent itemsets discovery process, that is minimizes support count operations as well as database scans and it count support of all frequent patterns which are generated by maximum frequent patterns, without accessing database.

Services
Related Articles in ASCI
Search in Google Scholar
View Citation
Report Citation

 
  How to cite this article:

R.S. Thakur, R.C. Jain and K.R. Pardasani , 2006. MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts. Trends in Applied Sciences Research, 1: 402-415.

DOI: 10.3923/tasr.2006.402.415

URL: https://scialert.net/abstract/?doi=tasr.2006.402.415

Introduction

Association rule mining finds interesting association among a large set of data items. With massive amount of data continuously being collected and stored. Many industries are becoming interested in mining association rules from their databases. The discovery of interesting association relationship among huge amount of business transaction records can help in many business decision making process, such as catalogue design, cross marketing and loss leader analysis.

The problem of mining frequent patterns first arose as a sub-problem of mining association rules (Agrawal and Srikant, 1994). A frequent pattern is a set of binary attributes (items) which support, i.e., the number of objects in the database containing it, is at least equal to a minimum support threshold minsup defined by the user. It then was discovered that frequent patterns are involved in a variety of problems (Han et al., 2000): mining sequential patterns (Agrawal and Srikant, 1995), episodes (Mannila et al., 1997), correlations (Brin et al., 1997; Silverstein et al., 1998), multi-dimensional patterns (Kamber et al., 1997; Lent et al., 1997), maximal patterns (Lin and Kedem 1998; Bayardo 1998; Zaki et al., 1997), closed patterns (Pasquier et al., 1999a; Pei et al., 2000; Taouil et al., 2000).

Typical algorithms for finding the frequent pattern, i.e., the set of all frequent itemsets (Arrawal and Srikant, 1994; Mannila et al., 1994), operate in a bottom-up breadth-first fashion. In other words, the computation starts from frequent 1-itemsets (minimal length frequent itemsets at the bottom) and then extends one level up in every pass until all maximal (length) frequent itemsets are discovered. All frequent itemsets are explicitly examined and discovered by these algorithms. When all maximal frequent itemsets are small, these algorithms perform reasonably well. However, performance drastically decreases when any of the maximal frequent itemsets becomes larger. This is due to the fact that a maximal frequent itemset of size l implies the presence of 2l -2 non-trivial frequent itemsets (its nontrivial subsets) as well, each of which is explicitly examined by such algorithms.

Of course the set of the maximal (length) frequent itemsets (called maximum frequent pattern) uniquely defines the entire frequent itemsets: frequent itemsets are precisely all the non-empty subsets of its elements. Thus trivially, discovering the maximum frequent pattern implies immediate discovery of all the maximal frequent itemsets. In many situations, it suffices to know the supports of the maximal frequent itemsets and the supports of all subsets of the maximal frequent itemsets. For such situations this MAXFP (Maximum Frequent Pattern) algorithm provides efficient solution.

In this study we propose Fast MAXFP algorithm to find the maximal frequent itemsets as well as support of all its subsets. Not only the this study contains the process of finding maximum frequent patterns but also support of all frequent itemsets which are subsets of maximal frequent itemsets discovered by MAXFP algorithm. This new concept also reduces number of database passes as compared to other approaches (Lin and Kedem, 1998).

The Fast MAXFP algorithm is based on Pincer-Search (Lin and Kedem, 1998) and Counting Inference approach (Bastide et al., 2000). As in Pincer-Search (Lin and Kedem, 1998), our approach use MFP (maximum frequent pattern) and search MFP from both bottom-up and top-down directions. This Fast MAXFP algorithm performs well even when the maximal frequent itemsets are long.

In Fast MAXFP algorithm, the bottom-up search follows concept of pattern Counting Inference method presented in (Bastide et al., 2000), were the Pascal-Gen algorithm was introduced, that minimizes as much as possible the number of pattern support counts performed when extracting frequent patterns. This method relies on the concept of key patterns, where a key pattern is a minimal (with respect to set inclusion) pattern of an equivalence class gathering all patterns common to the same objects of the database relation. Hence, all patterns in an equivalence class have the same support and the supports of the non-key patterns of an equivalence class can be determined using the supports of the key patterns of this class. With pattern counting inference, only the supports of the frequent key patterns (and some infrequent ones) are determined from the database, while supports of the frequent non-key patterns are derived from those of the frequent key patterns. In this approach, as in Apriori (Agrawal and Srikant, 1994), frequent patterns are extracted in a levelwise manner: During each iteration, candidate patterns of size k are created by joining the frequent patterns of size k-1, their supports are determined and infrequent ones are discarded. Using counting inference, if a candidate pattern of size k is a non-key pattern, then its support is equal to the minimal support among the patterns of size k- 1 that are its subsets. This allows to reduce the number of patterns considered during each database pass while counting supports and even more important, to reduce the total number of passes. This optimization is valid since key patterns have a property that is compatible with the pruning of Apriori: all subsets of a key pattern are key patterns and all supersets of a non-key pattern are non-key patterns.

The top-down search is implemented efficiently by introducing a set that we call the maximum-frequent-candidate-pattern or MFCP. This set consists of the minimum number of itemsets such that the union of all their subsets contains all the known (discovered so far) frequent itemsets but not any of the known infrequent itemsets. The known frequent and infrequent itemsets are those determined by the supports of the itemsets that have been discovered until the most recent pass. Thus obviously at any point of the algorithm MFCP is a superset of MFP. When the algorithm terminates, MFCP and MFP are equal. Unlike the bottom-up search that goes up one level in each pass, MFCP set can “move down” many levels in one pass.

In the presesnt study, we use the MFCP idea, as in (Lin and Kedem, 1998) along with counting inference (Bastide et al., 2000), to discover frequent itemsets in market basket data. In most cases, our algorithm not only reduces the number of passes of reading the database but also reduce the number of candidates (for whom support is counted). In such cases, both I/O time and CPU time are reduced by: i) eliminating the candidates that are subsets of maximal frequent itemsets found in MFCP. ii) using Pascal-Gen (Bastide et al., 2000), determine as much support count as possible without accessing the database by information gathered in previous passes.

The Problem of Mining Frequent Patterns
Definition 1
Let P be a finite set of items, O a finite set of objects (e. g., transaction ids) and R ⊆ OxP a binary relation (where (o, p) ∈ R may be read as item p is included in transaction o). The triple D = (O, P, R) is called dataset.

Each subset P of P is called a pattern. We say that a pattern P is included in an object o ∈ O if (o, p) ∈ R for all p ∈ P. Let f be the function which assigns to each pattern P ⊆ P the set of all objects which include this pattern: f(P) = {o ∈ O | o includes P }.

The support of a pattern P is given by supp(P) = card(f(P))/card(O). For a given threshold minsup ∈ (0,1), a pattern P is called frequent pattern or frequent itemsets if supp(P) ≥ minsup.

The problem of association rule mining consists of two stages (Agrawal and Srikant,1994): i) the discovery of frequent itemsets, followed by ii) the generation of association rules.

The maximum frequent pattern (MFP) is the set of all the maximal frequent itemsets. (An itemset is a maximal frequent itemset if it is frequent and no proper superset of it is frequent.) Obviously, an itemset is frequent if and only if it is a subset of a maximal frequent itemset. Thus, it is necessary to discover only the maximum frequent pattern during the first stage. Of course, an algorithm for that stage may explicitly discover and store some other frequent itemsets as a necessary part of its execution-but minimizing such effort may increase efficiency. If the maximum frequent pattern is known, one can easily generate the required subsets and count their supports by reading the database once, which is quite straightforward.

It follows, that in the vast majority of cases, the discovery of the maximum frequent pattern dominates the performance of the whole process. Therefore, first we find maximum frequent patterns and then calculate support of all frequents patterns.

Key Features of the Counting Inference Approach
Here, we give the theoretical basis of Counting Inference Approach as in (Bastide et al., 2000). Like Apriori, Counting Inference traverses the powerset of P levelwise: At the kth iteration, the algorithm generates first all candidate k-patterns.

Definition 2
A k-pattern P is a subset of P with card(P) = k. A candidate k-pattern is a k-pattern where all its proper sub-patterns are frequent.

For the candidate k-patterns one database pass is used to determine their support. Then infrequent patterns are pruned.

Key Patterns
Counting Inference approach is based on the observation that patterns can be considered as equivalent if they are included in exactly the same objects. We describe this fact by the following equivalence relation θ on patterns.

Definition 3
For patterns P, Q ⊆P, we let P θ Q if and only if f(P) = f(Q). The set of patterns which are equivalent to a pattern P is given by [P] = {Q ⊆ P| P θ Q}.

In the case of patterns P and Q with P θ Q, both patterns obviously have the same support:

Lemma 1

  Let P and Q be two patterns.
  (i) P θ Q⇒ supp(P) = supp(Q)
  (ii) PQ ∧ supp(P) = supp(Q) | PθQ

Proof

  (i) P θ Qf(P) = f(Q)⇒ supp(P) = card(f(P))/card(O) =
  card(f(Q))/card(O) = supp(Q).
  (ii) Since PQ and f is monotonous decreasing, we have f(P) ⊇ f(Q).
  supp(P) = supp(Q) is equivalent to card (f(P)) = card(f(Q)) which implies with the former f(P) = f(Q) and thus P θ Q.

Hence if we knew the relation θ in advance, we would need to count the support of only one pattern in each equivalence class. Of course we do not know the relation in advance, but we can construct it step by step. Thus, we will (in general) need to determine the support of more than one pattern in each class, but not of all of them. If we already have determined the support of a pattern P in the database and pass later a pattern Q ∈ [P], then we need not access the database for it because we know that supp(Q) = supp (P). The first patterns of an equivalence class that we reach using a levelwise approach are exactly the minimal patterns in the class:

Definition 4
A pattern P is a key pattern if P ∈ min (P). A candidate key pattern is a pattern where all its proper sub-patterns are frequent key patterns.

Observe that all candidate key patterns are also candidate patterns.

Pattern Counting Inference
In the algorithm, we apply the pruning strategy to both candidate patterns and candidate key patterns. This is justified by the following theorem as in (Bastide et al., 2000).

Theorem 1

  If Q is a key pattern and PQ, then P is also a key pattern.
  If P is not a key pattern and PQ, then Q is not a key pattern either.

Proof
Let PQ and P is not a key pattern. Then, ∃P’ ∈ min [P] with P’ ⊂ P. From f(P’) = f(P) it follows that f(Q) = f(Q\(P\ P’)). Hence Q is not minimal in [Q] and thus by definition not a key pattern. (i) is a direct logical consequence of (ii).

The algorithm determines, at each iteration, the key patterns among the candidate key patterns by using (ii) of the following theorem:

Theorem 2

  Let P be a pattern.
  (I) Let pP. Then P ∈ (P\{p}] if and only if supp(P) = supp(P\{p}).
  (ii) P is a key pattern if and only if supp(P) ≠ min pP (supp(P\{p})):

Proof
(I) The if part follows from Lemma 1.(ii). The only if part is obvious.
(ii) From (i) we deduce that P is a key pattern if and only if supp(P) ≠ supp(P\{p}), for all pP. Since supp is a monotonous decreasing function, this is equivalent to (ii).

Since all candidate key patterns are also candidate patterns, when generating all candidate patterns for the next level we can at the same time determine the candidate key patterns among them. If we reach a candidate k-pattern which is not a candidate key pattern, then we already passed along at least one of the key patterns in its equivalence class in an earlier iteration. Hence we already know its support. Using the following theorem, we determine this support without accessing the database:

Theorem 3

  If P is a non-key pattern, then
  supp(P) = min pP(supp(P\{p})):

Proof
“≤” follows from the fact that supp is a monotonous decreasing function. “≥”: If P is not a key pattern then there exists pP with PθP\{p}. Hence supp(P) = supp(P\{p}) ≥ minpP (supp(P\{q}).

Thus the database pass needs to count the supports of the candidate key patterns only.

A Fast MAXFP Algorithm for Discovering the Maximum Frequent
Pattern and Support Count
We propose a Fast Multi-Strategy algorithm which combines top-down and bottom-up search. It relies on a new data structure during its execution, the maximum-frequent-candidate-pattern (MFCP), which is defined below.

Definition 5
Consider some point during the execution of an algorithm for finding MFP. Some itemsets are frequent, some infrequent and some unclassified. The maximum-frequent-candidate-pattern (MFCP) is a minimum cardinality set of itemsets such that the union of all the subsets of its elements contains all the frequent itemsets but does not contain any infrequent itemsets, that is, it is a minimum cardinality set satisfying the conditions

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts

where FREQUENT and INFREQUENT, stand respectively for all frequent and infrequent itemsets.

Thus obviously at any point of the algorithm MFCP is a superset of MFP. When the algorithm terminates, MFCP and MFP are equal. The computation of our algorithm follows the bottom-up search approach. In each pass, in addition to counting supports of the candidates in the bottom-up direction by Pascal-Gen (Bastide et al., 2000), the algorithm also counts supports of the itemsets in MFCP: this set is adapted for the top-down search. This will help in pruning candidates, but will also require changes in candidate generation.

Consider now some pass k, during which itemsets of size k are to be classified. If some itemset that is an element of MFCP, say X of cardinality greater than k is found to be frequent in this pass, then all its subsets must be frequent. Therefore, all of its subsets of cardinality k can be pruned from the set of candidates considered in the bottom-up direction in this pass. These itemsets and their supersets will never be candidates throughout the rest of the execution, potentially improving performance. But of course, as the maximum frequent pattern is finally computed, they will not be forgotten.

Similarly, when a new infrequent itemset is found in the bottom-up direction, the algorithm will use it to update MFCP. The subsets of MFCP must not contain this infrequent itemset. By using the MFCP, we will be able to discover some maximal frequent itemsets in early passes. This is especially significant when the maximal frequent itemsets discovered in the early passes are long.

MFCP Updating Algorithm
Consider some itemset Y that has been just classified as infrequent and assume that it is a subset of some itemset that is an element of MFCP. To update MFCP, we replace X by |Y| itemsets, each obtained by removing from X a single item (element) of Y. We do this for each newly discovered infrequent itemset and each of its supersets that is an element of MFCP. Formally, we have the following MFCP-gen algorithm (shown here for pass k).

Algorithm
MFCP-Gen (From (Lin and Kedem, 1998)

Input
Old MFCP and the infrequent set Sk discovered in pass k

Output

New MFCP

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts

Example

Suppose {{1,2,3,4,5,6}} is the current (old) value of MFCP and two new infrequent itemsets {1,6} and {3,6} are discovered. Consider first the infrequent itemset {1,6}. Since the itemset {1,2,3,4,5,6} (element of MFCP) contains items 1 and 6, one of its subsets will be {1,6}.By removing item 1 from itemset {1,2,3,4,5,6}, we get {2,3,4,5,6} and by removing item 6 from itemset {1,2,3,4,5,6} we get {1,2,3,4,5}. After considering itemset {1,6}, MFCP becomes {{1,2,3,4,5},{2,3,4,5,6}}. Itemset {3,6} is then used to update this MFCP. Since {3,6} is a subset of {2,3,4,5,6}, two itemsets {2,3,4,5} and {2,4,5,6} are generated to replace {2,3,4,5,6}. The itemset {2,3,4,5} is a subset of the itemset {1,2,3,4,5} in the new MFCP and it will be removed from MFCP. Therefore, MFCP becomes {{1,2,3,4,5}, {2,4,5,6}}. The top-down arrows in Fig. 1 show the updates of MFCP.

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts
Fig. 1: Two-way search with Counting Inference approach, *Bold style represents candidates that are frequent, *Underline style represents candidates that are in frequent, *Italic style represents the candidates that are pruned by using MFCS, *Strikethrough itemsets represent the candidates that their supports are counted but will not be used to generate new candidates, *Itemsets in ellipses are the maximum frequent itemsets, *Itemsets within the [] are key pattern

A New Candidate Generation Method
Our method follows Counting Inference approach in bottom-up direction for generating candidate itemsets. for this purpose use Pascal-Gen algorithm.

Algorithm
Pascal-Gen From (Bastide et al., 2000) the join procedure of counting inference algorithm.

Input
Lk
, the set of frequent k-patterns p with their support p.supp and the p.key flag.

Output
Ck
, the set of candidate k+1-patterns c each with the flag c.key, the value

c.pred_supp and the support c.supp if c is not a key pattern.

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts
Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts

The way Pascal-Gen operates is basically known from the generator function Apriori-Gen which was introduced in (Agrawal and Srikant, 1994). In addition to Apriori-Gen’s join and prune steps, Pascal-Gen makes the new candidates inherit the fact of being or not a candidate key pattern (step 16) by using Theorem 1 and it determines at the same time the support of all non key candidate patterns (step 19) by using Theorem 3.

In working of Pascal-Gen is shown by example. For our example we use dataset in Fig. 4, the algorithm performs first one database pass to count the support of the 1-patterns (Fig. 2). As {5} has the same support as the empty set, {5} is marked as a non-key pattern and call Pascal-Gen to generate candidate pattern.

At the next iteration, all candidate 2-patterns are created and stored in C2 (Fig. 3). At the same time, the support of all patterns containing {5} as sub-pattern is computed. Then a database pass is performed to determine the supports of the remaining ten candidate patterns:

Similarly at the third iteration, it turns out in Pascal-Gen that each newly generated candidate pattern contains at least one sub-pattern, which is not a key pattern. so their supports are determined directly in Pascal-Gen. From there, the database will be accessed only for one candidate pattern. In the fourth and fifth iteration (if needed ), all supports are determined directly in Pascal-Gen.

In our algorithm, simultaneously Pascal-Gen, at each pass MFCP updates and after a maximal frequent itemset is added to MFCP, all of its subsets in the frequent set (as computed so far) will be removed. We show by example that if the Pascal-Gen join procedure is applied some of the needed itemsets could be missing from the candidate set. Consider Fig. 1. Suppose the original frequent itemset L3 is {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}, {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {2,4,6}, {2,5,6}, {3,4,5}, {4,5,6}. Assume itemset {1,2,3,4,5} in MFCP is determined to be frequent. Then all 3-itemsets of the original frequent set L3 will be removed from it by our algorithm, except for {2,4,6}, {2,5,6} and {4,5,6}. Since the join procedure uses a (k-1)-prefix test on the frequent set to generate new candidates and no two itemsets in the current frequent set {2,4,6}, {2,5,6}, {4,5,6} share a 2-prefix, no candidate will be generated by applying the join procedure on this frequent set. However, the correct candidate set should be {2,4,5,6}.

The full set of required candidates can be obtained by restoring some itemsets to the current frequent set. They will be extracted from MFCP, which implicitly maintains all frequent itemsets discovered so far. The itemsets that need to be restored are precisely those k-itemsets that have the same (k-1)-prefix as an itemset in the current frequent set.

Consider then in pass k, an itemset X in MFCP and an itemset Y in the current frequent set such that |X|>k. Suppose that the first k-1 items of Y are in X and the (k-1)’st item of Y is equal to the j’th item of X. We determine the k-subsets of X that have the same (k-1)-prefix as Y by taking one item of X that has an index greater than j and combining it with the first k-1 items of Y to get one of these k-subsets. After these k-itemsets are found, we generate candidates by combining them with the itemset Y. When we recovered new itemsets calculate their pred_supp, supp and key values. This is specified by the following recovery procedure.

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts
Fig. 2: Frequent 1-Itemsets

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts
Fig. 3: Candidate pattern C2 and Frequent 2-Itemsets L2

Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts
Fig. 4: Transaction Table

Algorithm
The recovery procedure

Input
Ck+1
, Lk and current MFP containing the maximal frequent itemsets discover until pass k

Output
A complete candidate set Ck+1

1. for all itemsets l in Lk
2. for all itemsets m in MFP
3. if the first k-1 items in l are also in m
4. /* suppose m.itemj = l.itemk-1 */
5. for i from j + 1 to |m|
6. Ck+1 := Ck+1 ∪ {{l.item1, l.item2,..., l.itemk, m.itemi}}
9)   forall new discovered itemsets cCk+1 do begin
10)     c.key := true; c.pred_supp := +∞
11)   forall k-subsets s of c do begin
12)       c.pred_supp := min(c.pred_supp, s.supp)
13)       if not s.key then c.key := false
14)     end
15)   end
16)   if not c.key then c.supp := c.pred_supp
17) return Ck+1.

Example
Fig. 1 shows that MFCP is {{1,2,3,4,5}} and the current frequent set is {{2,4,6}, {2,5,6},{4,5,6}}. The only 3-subset of {{1,2,3,4,5}} that needs to be restored for the itemset {2,4,6} to generate a new candidate is {2,4,5}. This is because it is the only subset of {{1,2,3,4,5}} that has the same length and the same 2-prefix as the itemset {2,4,6}. By combining {2,4,5} and {2,4,6}, we recover the missing candidate {2,4,5,6}. No itemset needs to be restored for itemsets {2,5,6} and {4,5,6}. As stated earlier, our algorithm will not consider the subsets of a maximal frequent itemset as candidates. Therefore, the prune procedure in our new candidate generation algorithm will remove those subsets.

Algorithm
The new prune procedure

Input
Current MFP and Ck+1 generated from Pascal-Gen and recovery procedures above

Output
Final candidate set Ck+1

1. for all itemsets c in Ck+1
2. if c is a subset of any itemset in current MFP
3. delete c from Ck+1

we eliminate the candidates that are subsets of elements of current MFP (line 3).
In summary, our candidate generation process contains the following three steps:

1. call the Pascal-Gen join procedure
2. call the recovery procedure if necessary
3. call the new prune procedure

The Fast MAXFP Algorithm
We now present our complete algorithm; it relies on the combined approach for determining the maximum frequent pattern.

Algorithm
The MAXFP Algorithm

Input
A database and a user-defined minimum support

Output
MFP which contains all maximal frequent itemsets

1. Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Counts .supp := 1;Image for - MAXFP: A Multi-strategy Algorithm for Mining Maximum Frequent Patterns and Their Support Countskey := true
2. L0 := 0; k := 1
3. C1 := {{i}| i ∈ I }
4. MFCP := {{1, 2, ….,n}}
5. MFP := 0
6. while Ck ≠0 begin
7. if k = 1 then
8.   {read database and count supports for itemsets in Ck and MFCP
9.     MFP := MFP ∪ {frequent itemsets in MFCP}
10.     Lk := {frequent itemsets in Ck}\{subsets of itemsets in MFP }
11.     for all lL1 do l.pred-supp := 1; l.key := (l.supp ≠1)
12.     }
13.   else {read database and count supports for itemsets in Ck call count_support_procedure and MFCP
14.       MFP := MFP ∪ {frequent itemsets in MFCP}
15. Lk :={ frequent itemsets in Ck call frequent_pattern_procedure}\{subsets of itemsets in MFP}
16.   }
17. Sk := {infrequent itemsets in Ck}
18. call the Pascal-Gen to generate Ck+1
19. if any frequent itemset in Ck is a subset of an itemset in MFP
20. call the recovery procedure to recover candidates to Ck+1
21. call new prune procedure to prune candidates in Ck+1
22. call the MFCP-gen algorithm when Sk is not empty
23. k := k + 1
24. end
25. return MFP

The algorithm starts with the empty set, which always has a support of 1 and which is (by definition) a key pattern (line 1 and 2). In line 8, frequent 1-patterns are determined they are marked as key patterns unless their support is 1(line 11). Pascal-Gen is called to compute the candidate pattern (line 18). MFCP is initialized to contain one itemset, which consists of all the database items. MFCP is updated whenever new infrequent itemsets are found (line 22). If an itemset in MFCP is found to be frequent, then its subsets will not participate in the subsequent support counting and candidate set generation steps. Line 10,15 and line 21 will exclude those itemsets that are subsets of any itemset in the current MFP set, which contains the frequent itemsets found in the MFCP. If some itemsets in Lk are removed, the algorithm will call the recovery procedure to recover missing candidates (line 20). Line 13 call count_support_procedure for counting support of candidate itemsets(when k>1). frequent_pattern_procedure call (when k>1) for finding frequent pattern from new generated candidate itemsets(line 15).

Algorithm
The count_support_procedure

Input
Ck
candidate set generated from the Pascal-Gen and recovery procedures above with each candidate c have flag c.key.

Output
Candidate set Ck in which each candidate c does count support c.supp.

1. if ∃cCk | c.key then
2. forall o ∈ D do begin
3. Co := subset(Ck; o)
4. forall cCo | c.key do
5. c.supp ++
6. end

Algorithm
The frequent_pattern_procedure

Input
Ck
, the set of candidate k-patterns c each with the flag c.key, the value

c.pred_supp and the support c.supp.

Output
Lk
, the set of frequent k-patterns p with their support p.supp and the p.key flag.

1. forall cCk do
2. if c.supp ≥ minsup then begin
3. if c.key and c.supp = c.pred_supp then
4. c.key := false
5. Lk := Lk ∪{c}
6. end
7. end

Counting the Supports of All Frequent Itemsets
The MAXFP algorithm extracts all maximal frequent patterns from database. The non-empty subsets of these maximal frequent patterns gives all frequent patterns but in those cases when we need to know the supports of all frequent patterns, there is no efficient algorithm for counting supports of all frequent itemsets.

Most of the previous algorithms (Lin and Kedem, 1998) require extra (one more) database pass for counting supports of all frequent itemsets which is generated by maximum frequent patterns. They build hash-tree for the maximum frequent patterns, which stores all the subsets of the maximal frequent patterns. For counting support of all frequent patterns they use Apriori algorithm with minor change and add a field for storing the supports of internal nodes. The support in every node is incremented whenever the node is visited in the forward direction.

Here we introduce new concept for counting support of all frequent patterns. In our MAXFP approach, Pascal-Gen algorithm uses concept of key patterns in this case each item has three fields, key, pred-supp and supp. The supp field used for storing support of itemsets, so there is no need to add extra field as in Lin and Kedem (1998). In some cases there is no need for counting support of all frequent itemsets when all itemsets generated during extracting maximum frequent patterns. But when above mentioned cases do not occur then we will use Pascal-Gen with minor changes for counting supports of those patterns which have not occurred during extracting maximum frequent patterns. It counts support of any frequent itemset by accessing record of frequent itemsets which have been extracted by MAXFP, or its subsets support (by property of key pattern). Using concept of key pattern modified Pascal-Gen counts the support of all frequent patterns without additional database access.

Performance Evaluation
According to both (Agrawal and Srikant, 1994) and our experiments, a large fraction the 2-itemsets will usually be infrequent. These infrequent itemsets will cause MFCP to go down the levels very fast, allowing it to reach some maximal frequent itemsets after only a few passes. Indeed, in our experiments, we have found that, in most cases, many of the maximal frequent itemsets are found in MFCP in very early passes.

Running Example
We have illustrated the new algorithm on the dataset shows in Fig. 4, for minsup = 2/5:

For finding maximum frequent pattern in above dataset our new MAXFP algorithm needs three database passes in which the algorithm counts the supports of 7+11+2 = 20 patterns. Pincer (Lin and Kedem, 1998) would have needed three database passes for counting the supports of 7+16+15 = 38 patterns for the same dataset. This algorithm performs well, when bottom-up approach totally turns out in Pascal-Gen then, there is no need of accessing database for support count of candidate patterns. If at any stage MFCP is not updated the algorithm reduces the number of database passes. On the other side if bottom up approach turns out to be Pascal-Gen then no additional database scan is required. Because all frequent patterns are generated during extracting maximum frequent patterns, so there is no need for counting support of all frequents patterns.

Conclusions

The maximum frequent pattern provides a unique representation of all the frequent itemsets. In many situations, it suffices to discover the maximum frequent pattern and once it is known, all the required frequent patterns can be easily generated.

The present study, Fast MAXFP algorithm discovers the maximum frequent pattern as well as counting the supports of all frequent patterns without accessing database very efficiently. This new algorithm reduces both the number of database passes and the number of candidates considered. By using the MFCP, we will be able to discover some maximal frequent itemsets in early passes. This early discovery of the maximal frequent itemsets reduces the number of candidates and the passes of reading the database, which in turn can reduce the CPU time and I/O time. This is especially significant when the maximal frequent itemsets discovered in the early passes are long. In bottom-up direction we have used pattern counting inference approach, which is based on the notion of key patterns of equivalence classes of patterns. It allows to count from the dataset the support of some frequent patterns only, the frequent key patterns, rather than counting the support of all frequent patterns.

For counting supports of all frequent patterns, it has following advantages.

There is no need of extra process to add new field with each itemsets.
It does not require additional database pass.
This algorithm can be extended with minor modifications to perform incremental mining of association rules.

REFERENCES

1:  Agrawal, R. and R. Srikant, 1994. Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago, pp: 478-499

2:  Agrawal, R. and R. Srikant, 1995. Mining sequential patterns. Proceedings of the 11th International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pp: 3-14
CrossRef  |  

3:  Bayardo, R.J., 1998. Efficiently mining long patterns from databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, June 1-4, 1998, New York, USA., pp: 85-93
Direct Link  |  

4:  Brin, S., R. Motwani and C. Silverstein, 1997. Beyond market baskets: Generalizing association rules to correlation. Proceedings of the ACM SIGMOD International Conference on Management of Data, May 13-15, Tuscon, AZ., pp: 265-276

5:  Bastide, Y., R. Taouil, N. Pasquier, G. Stumme and L. Lakhal, 2000. Mining frequent patterns with counting inference. ACM SIGKDD Explorations Newslett, 2: 66-75.

6:  Han, J., J. Pei and Y. Yin, 2000. Mining frequent patterns without candidate generation. Proceedings of the ACM SIGMOD International Conference on Management of Data, May 15-18, 2000, Dallas, TX., USA., pp: 1-12
CrossRef  |  Direct Link  |  

7:  Kamber, M., J. Han and J.Y. Chiang, 1997. Metarule-guidedmining of multi-dimensional association rules using data cubes. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, August 14-17, 1997, Newport Beach, CA., pp: 207-210

8:  Lent, B., A. Swami and J. Widom, 1997. Clustering association rules. Proceedings of the13th International Conference on Data Engineering, April 7-11, 1997, Birmingham, UK., pp: 220-231
CrossRef  |  Direct Link  |  

9:  Mannila, H., H. Toivonen and A.I. Verkamo, 1997. Discovery of frequent episodes in event sequences. Data Mining Knowledge Discov., 1: 259-289.
CrossRef  |  Direct Link  |  

10:  Pasquier, N., Y. Bastide, R. Taoil and L. Lakhal, 1999. Efficient mining of association rules using closed itemset lattices. Inform. Syst., 24: 25-46.
CrossRef  |  Direct Link  |  

11:  Pasquier, N., Y. Bastide, R. Taouil and L. Lakhal, 1999. Mining frequent closed itemsets for association rules. Proc. ICDT Conf., pp: 398-416.

12:  Pei, J., J. Han and R. Mao, 2000. CLOSET: An efficient algorithm for mining frequent closed itemsets. Proceedings of the ACM SIGMOD International Conference on Management of Data, May 15-18, 2000, Dallas, TX., USA., pp: 11-20

13:  Silverstein, C., S. Brin and R. Motwani, 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Min. Knowl. Discovery, 1: 39-68.

14:  Zaki,M.J., S. Parthasarathy, M. Ogihara and W. Li, 1997. New algorithms for fast discovery of association rules. Proc. KDD Conf., pp: 283-286.

15:  Lin, D. and Z.M. Kedem, 1998. Pincer-Search: A new algorithm for discovering the maximum frequent set. Proceedings of the 6th International Conference on Extending Database Technology, March 23-27, 1998, Valencia, Spain, pp: 105-119

16:  Mannila, H., H. Mannila, H. Toivonen, H. Toivonen, A.I. Verkamo and A.I. Verkamo, 1994. Improved methods for finding association rules. Proceedings of AAAI 1994 Workshop on Knowledge Discovery in Databases, July 31-August 4, 1994, Seattle, WA., USA., pp: 181-192

©  2021 Science Alert. All Rights Reserved