INTRODUCTION
Online shopping has become the primary activity in web. Over 85% of the world's
online population have used the internet to make a purchase and more than half
of internet users are regular online shoppers who make online purchases at least
once a month (Nielsen Company, 2008). Not like shopping
in supermarkets, consumers can easily find what they want in the capacious room
for display of goods. In Internet mall, goods can be only presented in web pages,
which can not present more goods for consumers at the same time. Though searching
functions are provided in most of web sites, it is not still convenient to find
what consumers really need. Efficient commodity organization and recommendation
can increase sales (Wolf, 2007; Lee
and Kwon, 2008). It is necessary for webmasters to let consumers see the
commodities that they really want easily, which means that the webmasters must
present the salable goods for most of consumers on the prominent position in
their websites. Obviously, what are the salable goods is the primary problem.
Some webmasters often place the promotional or discount products in the front
page, but maybe they are not what the consumers really want. In this study,
a strategy based on frequent pattern mining is proposed to find the most welcome
product combination for most of consumers, which is applied on the consumers'
browsing records.
Frequent pattern mining is a classical problem in data mining, many researchers
give out their solutions for different data environments. Agrawal
and Srikant (1995) firstly introduced frequent sequential patterns in large
databases of customer transactions and in Srikant and Agrawal
(1996), they generalized the problem with time constraints and relaxed the
restriction on the consistency of the elements' sources. The advised methods
use the strategy of candidate set generation-and-test. Han
et al. (2004) proposed a novel FP-growth mining method, which adapts
a FP-tree structure and can compress a large database into a smaller FP-tree
structure to reduce the database scans and the number of candidate sets. Pei
et al. (2004) and Han et al. (2000),
are developed FreeSpan and PrefixSpan, both of them are projection-based, sequential
pattern-growth approach, which can reduce the generation of candidate subsequence
and are fit for more larger sequence databases.
The online shopping environment has its own characteristics, which is a little different from classical sequential database in the strict sense. We can consider its data as two-valued. Every sequence is possible to be long because of varieties of commodities and frequent browsing patterns are often short because of the fact that a consumer is impossible to be interested in large numbers of goods at one time. It is important to examine the frequent pattern problem in this situation. In this study, we propose a method based on frequent dependent relationship among commodities, the dependent relationship is computed according to a frequent similar function and a given support threshold. Through the dependent relationship, we can find all containing classes which can help us to narrow the search space.
PROBLEM DEFINITION
The problem of frequent browsing pattern mining in online shopping environment is detailed and some formal definitions are made for the problem. A concrete example is accompanied to understand these definitions.
A two-dimension matrix is used to denote users' browsing data, whose rows represent users and columns represent commodities, which will be also called as items. Every row can be called user's browsing vector, every column can be called commodity's browsed vector. The whole matrix is called as basic fact table (abbr. FT). Every field in the table is two-valued, 0 or 1. If the ith user browsed the jth commodity, FT[i][j]=1, otherwise 0.
A commodity pattern is a combination of some commodities, which is a subset of the whole commodity set. A minimum support value is given, remembered as min_support, which is a threshold for deciding whether a commodity combination pattern is frequent. A pattern is frequent iff all commodities in the pattern appear in at least min_support users' browsing vectors synchronously.
In order to compare the frequent browsed similarity among commodities. A frequent similar function is introduced, remembered as simFrequent. This function returns an integer.
Given two different commodities ci an cj, whose corresponding
browsed column vectors are Cvci and Cvcj and both of them
are n-dimensional. ci and cj will appear in some frequent
subpatterns iff simFrequent (Cvci, Cvcj)≥min_support.
simFrequent can also take more than two parameters or only one parameter. For
example:
Obviously, simFrequent is monotone. Given two sets of items, S1
and S2, S1⊆S2. If simFrequent (S2)≥min_support,
simFrequent (S1)≥min_support, is also right. If simFrequent (S1)≥min_support,
simFrequent (S1)≥min_support, too. We will utilize the characteristic
to optimize our computing later. We make several related definitions as follows,
Definition 1: Frequent dependent relationship given a commodity X, for every element Y in the commodity set, if simFrequent(X,Y)≥min_support, we say frequent dependent relationship exists between X and Y, remembered as X→Y.
| Table 1: |
Basic fact table of browsing log |
 |
Definition 2: If X→Y and ∀i, X[i]≥Y[i], we say that X
decides Y, remembered as x⇒Y.
Obviously, X→Y, iff Y→X. → is symmetrical, but not transitive
and ⇒ is transitive, not symmetrical. If there are X→Y and Y→Z,
we can not get X→Z. If there are X→Y, Y→Z and X→Z, we still
cannot conclude X→Y∪Z. But to operator ⇒, the results are reverse.
Here,
.
Definition 3: Containing class given a commodity X, for every element Y in the commodity set, if X→Y, then Y is an element of X's containing class, remembered as Contain(X).
If X⇒Y, Y is an element of X's containing class, but X does not belong to Contain(Y) and it is trivial to compute Contain(Y) at this situation, because any element that has dependent relationship with Y must have dependent relationship with X. All commodities that possibly appear in the same frequent pattern with X are in X's containing class.
Example 1: Let, Table 1 be the basic fact table. Suppose min_support = 2, a1→a2, a1→a3, a1→c1, a1→c2 are right in the table, then we have Contain(a1) = {a2,a3,c1,c2}. It means that the biggest frequent pattern containing a1 is possible to be {a1,a2,a3,c1,c2}. Equally, Contain(a2) {a1,a3,b2,c1,c2}, a2 = Contain(b2) because of a2⇒b2.
Definition 4: Frequent browsing pattern given a containing classes C of X, if simFrequent (X,Y)≥min_support, ∀Y, Y⊆C, we say X∪Y is a frequent browsing pattern. Considering Contain(a1), the frequent patterns are {a1,a2,c1}, {a1,a3}, {a1,c1,c2}.
BROWSING PATTERN MINING WITH FREQUENT DEPENDENCY
Here, we propose an algorithm to find all maximum frequent browsing patterns from the basic fact table considering the given support threshold.
Data cleaning and preparation: Users' browsing data often exists much noise, which are not users' real intention about shopping and should be cleaned. For example, if a consumer wants to buy something from some website, his (her) interests are often concentrated on some goods in one or several categories. If one consumer browses various goods or categories in a short time, obviously his (her) browsing is not for buying, or does not contain a clearing buying objectives, such records should be cleaned. Here we should define a threshold which can help to delete the records efficiently and it will be set empirically. Certainly, if any dependency does not hold among the items in the table, the deep mining is no meaning. When we remove some illegal data, we should find all single frequent items according to the simFrequent function. The frequent item, ci, should satisfy:
The set including all single frequent items is marked as:
Multiple candidate sets method: All frequent browsing patterns are composed
of the items in FreItemSet, we can use an exhaustive method to find them, but
it has a very low efficiency because of massive computing. The computing is
necessary only for items that have dependent relationship. The basic strategy
by our algorithm is to find the containing classes based on dependent relationship
of every element in the commodity set, which can efficiently reduce the searching
space and degrade the computing complexity and then use an iterative process
to compute the frequent patterns in every containing class. As we have seen
in the above section, transitivity does not hold for operation →. For example,
if we want to know whether {a1, a2 ,a3} is a frequent pattern, we should have
simFrequent(a1,a2,a3)≥min_support, not just simFrequent(a1,a2)≥min_support
and simFrequent (a2,a3)≥min_support.
Firstly, we should compute the dependent relationships among the items in FreItemSet
with the simFrequent function, the result is a set composed of all dependent
pairs, remembered as dependSet. It is possible for some single frequent items
to have not any dependent relationship with other frequent items. We should
remove all such single frequent items. The rest dependent pairs are grouped
by their left part, for every group, we obtain a corresponding containing class.
For example, {X→Y, X⇒Z, X→R, X→S} is one group of the result,
then Contain{X} = {Y,Z,R,S} and it is impossible for other items to be included
by Contain(X), which means that no other items except the ones in Contain(X)
can co-occurrence in the final frequent browsing patterns with X. We will compute
the frequent browsing patterns according to the above containing classes. Here,
we utilize the monotone of simFrequent and the similar thinking of Hasse Diagram
(Sharma et al., 2007) to optimize computing.
We give a Hasse diagram in Fig. 1, suppose {a, b, c, d} is
the current containing class for x. When we find that {x,a,b,c} is a frequent
pattern, then {x,a,b}, {x,a,c}, {x,b,c} must be under monotone and extra computing
can be avoided.
| Algorithm 1: |
Multiple candidate sets browsing pattern mining based on frequent
dependency |
 |
|
| Fig. 1: |
Lattice for dependent relationship |
If {x,b,d} is not a frequent pattern, the computation for {x,a,b,d} and {x,b,c,d}
is also redundant. Here, we can do further optimization based on computation
path. For example, {x,a,b,c} is not a frequent pattern, which means that at
least one of its subsets is not frequent, suppose it is {x,a,b}, according to
the monotone of simFrequent, the other supersets of {x,a,b}, here is {x,a,b,d},
must not be. If we find {x,a,c} is frequent, the next better choice should be
{x,b,c,d}, not {x,a,c,d}. The computing complexity for {x,b,c,d} and {x,a,c,d}
is same, but {x,a,c}∩{x,b,c,d}⊆ {x,a,c}∩{x,a,c,d}, when {x,b,c,d}
is frequent, we can remove more elements. The details are elaborated in Algorithm
1.
In the proposed method, we divide the only candidate set composed by all single frequent items into multiple candidate sets through the computing of dependent relationship, which can reduce the search space efficiently. In every sub-search space, we utilize the monotone of simFrequent and the Hasse diagram structure to prune the search space and find an efficient computation path.
RESULTS
Here, we evaluate multi-candidate set browsing pattern mining based on frequent dependent relationship (MCSFD) on our datasets and show how MCSFD outperforms single frequent item candidate set (SFICS) in runtime and the later regards all single frequent items as a candidate set without considering the dependent relationships. Both SFICS and MCSFD are implemented in Java. All experiments have been performed on Windows environment with a Core Duo 2.2 GHz CPU and 2 GB main memory. The datasets are simulated according to the real online shopping scenarios, which are represented by matrixes and summarized in Table 2.
The name of every dataset is formatted as AAA_BBB_CCC, in which AAA represents the number of consumers, BBB the number of commodities and CCC the minimum support threshold. For our evaluation, we compare MCSFD against SFICS according to the execution time on different datasets. For every dataset, we execute both SFICS and MCSFD for three times and get the average time for the experiment results (Table 3).
In Fig. 2, we fixed the number of commodities to 1000, let
the number of consumers varies from 10000 to 100000 and then compare the runtime
between SFICS and MCSFD. With the different number of consumers, the value of
frequent support degree is decided empirically. The number of both frequent
items and frequent dependency increase when the number of consumers changes
from 10000 to 50000.
| Table 2: |
Datasets summarization |
 |
| Table 3: |
Experiment execution time (sec) |
 |
|
| Fig. 2: |
Commodity no. = 1000 |
But both of them in 100000_1000_4550 decrease, which is because of a bigger
threshold for the frequent support value. Every commodity is welcome by some
certain number of people, when the number of consumers reaches a certain value,
its increase contributes little to the support degree of commodities. To every
dataset, MCSFD need less time than SFICS. In the second group of experiments
(Fig. 3), we fixed the number of consumers on 50000, the corresponding
support degree is pegged at 2500 and let the number of commodities rises from
500 to 2000. In every experiment, MCSFD has better performance than SFICS. When
the number of commodities is 2000, MCSFD needs only 313 sec, while SFICS, 1139.3
sec, is 2.64 times larger than MCSFD. The experiments show that MCSFD is more
effective especially for large number of commodities.
|
| Fig. 3: |
Consumer No. = 50000 |
|
| Fig. 4: |
CommodityNo. 1000 and support = 3000 |
Another group of experiments are carried on a group of datasets in which the number of commodities and single frequent items are fixed, 1000 and 3000. The runtime by both SFICS and MCSFD is linear with the number of consumers (Fig. 4). The experiments shows that our method is more fit in the environment where more single frequent items and more frequent subpatterns exist.
CONCLUSIONS
In this study, we examined the problem of effective commodity presentation
on online shopping websites and have proposed an approach based on frequent
dependent relationship to mine frequent browsing patterns from the users' browsing
log. Through introducing the frequent dependent relationship among commodities,
we can construct containing classes that are just the candidate sets for frequent
pattern mining, the process narrows the searching space and reduces the computing
complexity. The experimental results on our datasets show our method's efficiency
for finding all frequent browsing patterns and it can be applied on more large
datasets. Based on the present study, we will step further to find the optimal
browsing patterns for every kind of users and provide more effective recommendation
for specific types of users.
ACKNOWLEDGMENTS
This study is supported by National Natural Science Foundation of China (NSFC No. 60961002). Authors would like to thank Ping Zhou for helpful discussions and reviewing the draft of the study and her advice improved the quality of the study. Jingwei Zhang devoted himself to designing and carrying out some experiments.