Frequent Pattern (FP)-Growth algorithm

The Frequent Pattern (FP)-Growth method is used with databases and not with streams. The Apriori algorithm needs n+1 scans if a database is used, where n is the length of the longest pattern. By using the FP-Growth method, the number of scans of the entire database can be reduced to two.
The algorithm extracts frequent item sets that can be used to extract association rules. This is done using the support of an item set. The terminology, that is used for this algorithm is described in chapter IV-A1.

The main idea of the algorithm is to use a divide and conquer strategy:
Compress the database which provides the frequent sets; then divide this compressed database into a set of conditional databases, each associated with a frequent set and apply data mining on each database.

To compress the data source, a special data structure called the FP-Tree is needed [26]. The tree is used for the data mining part.
Finally the algorithm works in two steps:

  1. Construction of the FP-Tree
  2. Extract frequent item sets

Construction of the FP-Tree

The FP-Tree is a compressed representation of the input. While reading the data source each transaction t is mapped to a path in the FP-Tree. As different transaction can have several items in common, their path may overlap. With this it is possible to compress the structure.

Figure 7 shows an example for the generation of an FP-tree using 10 transactions.
contruction_fp_tree

The FP-Tree is generated in a simple way.
First a transaction t is read from the database. The algorithm checks whether the prefix of t maps to a path in the FP-Tree. If this is the case the support count of the corresponding nodes in the tree are incremented. If there is no overlapped path, new nodes are created with a support count of 1. Figure 8 shows the corresponding activity diagram using an UML (Unified Modeling Language) activity diagram.

activity-fp-construction

Additional a FP-Tree uses pointers connecting between nodes that have the same items creating a singly linked list.
These pointers, represented as dashed lines in Figure 7, 9 and 10 are used to access individual items in the tree even faster.

The corresponding FP-Tree is used to extract frequent item sets directly from this structure. Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path.
In the best case scenario there is only a single node, because all transactions have the same set of items. A worst case scenario would be a data source where every transaction has a unique set of items. Usually the FP-tree is smaller than the uncompressed one, because many transactions share items.
As already mentioned the algorithm has to scan the data source twice.

  • Pass 1:
    The data set is scanned to determine the support of each item. The infrequent items are discarded and not used in the FP-Tree. All frequent items are ordered based on their support.
  • Pass 2:The algorithm does the second pass over the data to construct the FP-tree.

The following example shows how the algorithm works.
According to Figure 7 the first transaction is {a,b}.

  • Because the tree is empty, two nodes a and b with counter 1 are created and the path is created.
  • After {b, c, d} was read, three new nodes b, c and d have to be created. The value for count is 1 and a new path is created.  Because the value b was already in transaction one, there is a new pointer between the b's (dashed lines).
  • The transaction {a, c, d, e} overlaps with transaction one, because of the a in the first place. The frequency count for a will be incremented by 1. Additional pointers between the c's and d's are added.

After each transaction was scanned, a full FP-Tree is created (Figure 7-iv). Now the FP-Growth algorithm uses the tree to extract frequent item sets.

Extract frequent item sets

A bottom-up strategy starts with the leaves and moves up to the root using a divide and conquer strategy. Because every transaction is mapped on a path in the FP-Tree, it is possible to mine frequent item sets ending in a particular item, for example e or d. So according to Figure 9, the algorithm first searches for frequent item sets ending with e and then with d, c, b and a until the root is reached. Using the pointers, each the paths can be accessed very efficient by following the list. Furthermore each path of the tree can be processed recursively to extract the frequent item sets, so the problem can be divided into smaller subproblems. All solutions are merged at the end. This strategy allows to execute the algorithm parallel on multiple machines [28].

fp-tree-subproblems

The FP-Growth algorithm finds all item sets ending with a specified suffix using the divide and conquer strategy. Assume the algorithm analyzes item sets ending with e. To do so, first the item set e has to be frequent. This can be done using the corresponding FP-Tree ending in e (Figure 10 (a)). If it is frequent, the algorithm has to solve the subproblem of finding frequent item sets ending in de, ce, be and ae. These subproblems are solved using the conditional FP-Tree (Figure 10 (b)).

The following example (Figure 10) shows how the algorithm solves the subproblems with the task of finding frequent item sets ending with e [26]. Assume the minimum support is set to two.

  1.  The first step is to collect the initial prefix path (Figure 10 a)
  2. From this prefix path the support count is calculated by adding all support counts with node e. In the example the support count is 3.
  3. Because the algorithm has to solve the subproblems of finding frequent item sets ending with de, be, ce and ae. To solve the subproblems the prefix path has to be converted into a conditional FP-Tree ( Figure 10 b). This tree is used to find frequent item sets ending with a specific suffix.
    1. Update the support count along the prefix path that don't contain e. Consider the path far right of the tree: . This path includes the transaction {b,c} that doesn't contain the item e. Because of this the count along this prefix path has to be 1.
    2. The node e has to be removed, because the support counts have been updated to reflect only transactions that contains e. The subproblems of finding item sets ending in dece, be and ae no longer need information about the node e.
    3. Because the support counts were updated, there might be some prefix paths that are no longer frequent. According to Figure 10 (a) the node b appears only once with support equal to 1. It follows that there is only one transaction that contains b and e. So be is not frequent and can be ignored.
  4.  The tree 10 b is used to solve the subproblems of finding frequent item sets ending with de, ce and ae. \newline
    Consider the subproblem of finding frequent item sets ending with de. A new prefix path tree is needed (Figure 10 c). After the frequency counts for d were updated, the support count for {d,e} is equal to 2 and it fits with the defined conditions. Next the conditional FP-tree for de is constructed using the method as from step 3. Figure 10 d shows the conditional FP-tree for de. This tree contains only one item a. The support is 2 which fits the conditions. The algorithm extracts the item set {a,d,e} and this subproblem is completely processed.
  5. The algorithm starts with the next subproblem.

fp-growth-example

This example demonstrates that the runtime depends on the compression of the data set.
The FP-Growth algorithm has some advantages compared to the Apriori algorithm:

  1. Compressed data structure.
  2. No expensive computation of candidates.
  3. Using a divide and conquer strategy.
  4. Scanning the data source only twice.

It can be assumed that the FP-Growth algorithm is more efficient.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Time limit is exhausted. Please reload the CAPTCHA.