Association mining can be used to identify items that are related to other items. A concrete daily life example could be the analysis of baskets in an online shop or even in the supermarket. Because each customer purchases different combinations of products at different times there are questions that can be addressed with associated rule mining:
- Why do customers buy certain products?
- Who are the customers? (Students, families,...)
- Which products might benefit from advertising?
The calculated rules can be used to restructure the market or to send special promotions to certain types of customers. This chapter discusses the Apriori and the FP-Growth algorithm. Both algorithm often provide the basis for further specialized algorithms. A common strategy, which is used by association rule mining algorithms, is to disjoin the problem in two tasks:
- Generation of frequent item sets:
The task is to find item sets that satisfy a minimum support that is usually defined by the user’s code.
- Generation of rules:
The task is to find confidence rules using the frequent item sets.
The Apriori algorithm (see Chapter IV-A2) and the FP-Growth algorithm (see Chapter IV-A3) are both based on this principle.
The following terminology has been taken from , , . These definitions were adapted in order to discuss the algorithms of this chapter more accurately. Let S be a stream or a database with n elements. Then is an item set with n items. So an item set could be a basket in a supermarket containing n elements. Furthermore an association rule is of the form where and , . So the rule mining can extract a trend such as ”when stocks of credit institutes go down, companies for banking equipment follow”.
Using these rules there are no restrictions on the number of items which are used for such a rule . The Support of an item set is the percentage of records which contain the specific item set. Assume that is the frequency of occurrence of an item set. Then the support of an item set A is:
The Confidence is calculated as the relative percentage of records which contain both A and B. It determines how frequently items in B appear in a transaction that contains A:
Additionally the frequency of the item set is defined as the number of transactions in S that contains the item set I. denotes the frequent item sets of length k .
2. Apriori Algorithm
This section discusses the Apriori algorithm which is one of the most common algorithms for association rule mining. The Apriori algorithm is using the prior knowledge to mine frequent item sets in a database or stream. It is actually a layer-by-layer iterative search algorithm, where the item set k is used to analyze the item set k + 1. The algorithm is based on a simple principle: <br/ >
"If an item set is frequent, then all subsets of this item set are also frequent."
To use the algorithm, the user specifies the minimum support (min_sup), the data source (S) and the minimum confidence (min_conf) (See Chapter IV-A1 for terminology). The first task is to extract frequent item sets.
Extract frequent item sets
Before the iteration starts, the algorithm scans the database to find the number of each item. This item has to satisfy the min sup input. Figure 4 shows the pseudocode for the implementation of the Apriori algorithm. Figure 3 shows the according activity diagram using an UML (Unified Modeling Language) activity diagram.
The algorithm outputs a set of itemsets I with .
The function aprioriGen in line 4 is the part where the set of candidates is created. It takes as an argument. This function is performed in two steps:
- Join: Generation of a new candidate itemset .
- Prune: Elimination of item sets with
With this it is assumed that the items are ordered, e.g. in an alphabetical way.
Assume that there are distinct pairs of sets in .
Then the join-step joins all k-1 itemsets that differ by only the last element. The following SQL-Statement shows a way how to select those items:
select from where
The line 8 to 13 (Figure 4) are used for the support counting. This is done by comparing each transaction I with the candidates . Now is used to proceed with the prune step.
This is a very expensive tasks specially if the number of transactions and candidates is very large. The result of the statement is added to the set . Now is used to proceed with the prune step.
In the prune step, the set is generated using the candidates of the set by pruning those combinations of items which can not satisfy the min_sup value (Line 14).
After the For-loop in line 3 terminates, the algorithm outputs a set of item sets were the support is greater than the value min sup , , .
With this it is possible to generate association rules.
Generation of association rules
The generation of the association rules is similar to the generation of the frequent item sets and it is based on the same principle. Figure 5 and Figure 6 show the pseudocode for the implementation of the Apriori algorithm.
First, all confidence rules are extracted using the frequent
item sets. These rules are used to generate new candidate
For example, if the rules and are rules that fit the minimum confidence then there will be a candidate rule . This is done by merging the result of both rules. The new rule has to satisfy the min_conf value, which was specified by the user.
The generation of such rules is done iteratively. The function aprioriGen (Figure 6, line 3) is used again to generate candidates for new rules. Then the set of candidates is checked for the min_conf value (Figure 6, line 6). For this the same assumption as in the generation of the frequent item sets is used. So if there is a new rule that does not satisfy
this assumption, all subsets of that rule won’t satisfy the assumption as well. Because of this the rule can be removed (Figure 6, line 9).
Example - Association rule mining
Usually the algorithms for associated rule mining are used for basket
analysis in online stores. To give an example, assume following transactions:
Assume the rule . To compute the confidence for this rule, the support it the item set is needed:
Then the confidence for this rule is computed as follow:
So the rule has a confidence of . So if there is bread in a basket, the probability that the same basket also contains butter is about 100%.
Now assume the rule . Again the support value has to be calculated:
The calculation of the confidence is done as above:
So the rule $Butter \implies Milk$ has a confidence of 66%.
So if there is butter in a basket, the probability that the same basket also contains milk is about 66%.