Generate Association Rules (Market Basket Analysis)

DESCRIPTION:

Generate association rules from a dataset. This is often known as "market basket analysis" when it is used to analyze data on customer grocery shopping trips, where a customer buys a set ("basket") of items on each trip. The goal is to discover rules relating different items, such as "customers that buy chips often also buy dip," which can be used for making product placement and pricing decisions.

This function requires the bigdata library section to be loaded.

USAGE:

bd.assoc.rules(data,
               input.format="item.list",
               item.columns=NULL,
               id.columns=character(0),
               id.sort=T,
               prescan.items=F,
               init.items=NULL,
               min.support=0.1, min.confidence=0.8,
               min.rule.items=2, max.rule.items=5,
               output.rule.strings=T, output.rule.items=F,
               output.rule.sizes=F, output.measures=T,
               output.counts=F,
               rule.support.both=T,
               sort.output=c("lift", "rule"))

REQUIRED ARGUMENTS:

data
The input data to be analyzed. It can be a data.frame or bdFrame. The input.format argument specifies how transactions are read from this object.

OPTIONAL ARGUMENTS:

input.format
This specifies how transaction items are represented in the data object. It must be one of the four strings "item.list", "column.value", "column.flag", or "transaction.id". The four formats are described below.
item.columns
The names or numbers of the data columns containing items. If NULL, all columns in data are items.
id.columns
The names or numbers of the data columns identifying the transaction in the "transaction.id" input format.
id.sort
If this argument is true, the input data will be sorted by the "id.columns" columns before processing the "transaction.id" input format. If the input data is already sorted, this argument can be set to false to avoid sorting the data.
prescan.items
If this argument is true, the transactions are scanned and the initial item list is created out of memory, by calling the function . Otherwise, bd.assoc.rules will construct a table of all unique items that appear in the input transactions, even if most of the items will not appear in rules (because they do not appear in enough transactions). If the input data contains many different items, such as thousands of SKUs for retail data, this may run out of memory and fail with an error. Setting prescan.items to true can avoid possible memory problems, at the cost of additional runtime.
init.items
If this argument is given, it should be a vector of item strings. When reading transactions, bd.assoc.rules will ignore any items not on this list. You can produce this vector by calling to get raw item counts, and then selecting items of interest.
min.support
The minimum support for items and rules, as a fraction (from 0.0 to 1.0) of the total number of input transactions. Note that the definition of rule support is effected by the value of the rule.support.both argument.
min.confidence
The minimum confidence for generated rules, as a fraction (from 0.0 to 1.0) of the total number of input transactions.
min.rule.items
The minimum number of items in generated rules. This counts both the consequent item and any antecedent items, so the default value of 2 will produce rules with a single consequent and at least one antecedent. If min.rule.items is set to 1, bd.assoc.rules may produce rules with one consequent and zero antecedents. These are meaningful rules, indicating that the consequent item appears in a significant number of transactions, regardless of what items appear with it.
max.rule.items
The maximum number of items in generated rules. This counts both the consequent item and any antecedent items, so the default value of 5 will produce rules with up to one consequent and four antecedents.
output.rule.strings
If this is true, the output data includes a column named "rule" containng the generated rule, formatted as a string. The example rule string "aa <- bb cc" is a rule with a single consequent item "aa" and two antecedent items "bb" and "cc". The antecedent items are always sorted alphabetically within a rule.
output.rule.items
If this is true, the output data includes a column named "con1" containing the generated rule consequent, and columns "ant1", "ant2", etc., containing the rule antecedents. If a given rule has only one antecedent, columns "ant2" and so on are empty strings. Using these columns, it is possible to process the rule items without parsing the rule strings.
output.rule.sizes
If this is true, the output data includes several columns with values measuring the number of items in each generated rule: "conSize" is the number of consequent items in the rule (currently always 1), "antSize" is the number of antecedent items in the rule, and "ruleSize" is the total number of items in the rule.
output.measures
If this is true, the output data includes the columns "support", "confidence", and "lift" with calculated numeric measures for each generated rule. These values are described below. Note that the definition of rule support is effected by the value of the rule.support.both argument.
output.counts
If this is true, the output data includes columns giving raw counts for each generated rule. These values can be used to calculate measures such as "support", as well as more complicated measures for the rules. The raw count columns are: "conCount" is the number of input transactions containing the rule consequent, "antCount" is the number containing the antecedents, "ruleCount" is the number containing both consequent and antecedents, "transCount" is the total number of transactions in the input set, and "itemCount" is the number of items used for creating rules. The "transCount" and "itemCount" values are the same for every rule.
rule.support.both
If this is true, the support of a rule is calculated as the number of transactions where all items in the rule occur (both the antecedents and the consequent), divided by the total number of transactions. If this is false, an alternate definition of rule support is used, where the support of a rule is calculated as the number of transactions where the antecedents appear, divided by the total number of transactions. This value effects how the min.support argument is interpreted, as well as the "support" values output when output.measures is true.
sort.output
A vector of output columns names used to sort the result. The result data is sorted by each of these columns, in alphabetical order (for string columns) or descending order (for numeric columns). The default, c("lift", "rule"), will sort rules with the highest lift values first, and sort rules with the same lift value in alphabetical order. If this includes column names that are not in the result, these column names will be ignored.

VALUE:

a bdFrame or data.frame of the same type as the input "data" argument, containing one generated rule per row. The arguments output.rule.strings, output.rule.items, etc., specify which columns appear in each output row.

DETAILS:

bd.assoc.rules takes a dataset of "transactions", each containing a set of "items", and generates a set of association rules indicating which items appear with (are "associated with") which other items.

The significance of each generated rule is indicated by the various measures output when output.measures is true. The confidence and lift measure values can be calculated from the output.counts values, using the following equations:

confidence = ruleCount / antCount
lift = (ruleCount / antCount) / (conCount / transCount)

If the rule.support.both argument is true (the default value), the support measure is calculated using the following equation:

support = ruleCount / transCount

If the rule.support.both argument is false, the support measure is calculated using the following equation:

support = antCount / transCount

The output.counts count values also could be used to compute other measures.

The bd.assoc.rules function uses the apriori algorithm code from http://www.borgelt.net/apriori.html. The original LGPL source code and the modified source code used by this function is included in the SHOME/library/bigdata/apriori/ directory. The URL above also contains more information on association rules and the apriori algorithm.

INPUT FORMATS:

bd.assoc.rules processes input data consisting of a series of "transactions", where each transaction contains a set of "items". The argument input.format specifies how transactions are repesented in the data object. It must be one of the four strings "item.list", "column.value", "column.flag", or "transaction.id":

Input format "item.list":

In this input format, each input row represents one transaction. The transaction items are all non-NA, non-empty strings in the item columns. In this format, there must be enough columns to handle the maximum number of items in a single transaction.

For example, consider the following four-column, two-row dataset:

i1     i2     i3     i4
milk cheese  bread
meat  bread

The first transaction contains items "milk", "cheese", and "bread", and the second transaction contains items "meat" and "bread".

Input format "column.flag":

In this input format, each input row represents one transaction. The column names are the item names, and each column's item is included in the transaction if the column's value is "flagged". More specifically, if an item column is numeric, it is flagged if its value is anything other than 0.0 or NA. If the column is a string or factor, the item is flagged if the value is anything other than "0", NA, or an empty string.

For example, the the following two-row dataset represents the same transactions as the example above:

bread meat cheese milk cereal chips dip
1    0      1    1      0     0   0
1    1      0    0      0     0   0

This format is not suitable for data where there are a large number of possible items, such as a retail market basket analysis with thousands of SKUs, since it would require one column for each SKU.

Input format "transaction.id":

In this input format, each transaction is represented by one or more rows. Each row has one or more columns identifying the transaction (specified by the id.columns argument), along with one or more columns containing items as in the "item.list" format. All of the rows with the same transaction ID contain items for a single transaction. This is a very efficient format when individual transactions can have a large number of items, and when there are many possible distinct items.

For example, the following dataset represents the same transactions as the example above:

id   item
10001  bread
10001 cheese
10001   milk
10002   meat
10002  bread

Input format "column.value":

In this input format, each input row represents one transaction. Items are created by combining column names and column values to produce strings of the form "name=val". This is useful for applying association rules to surveys where the results are encoded into a set of factor values.

The following dataset represents three transactions:

Weight Mileage Fuel
medium    high  low
medium    high  low
low    high  low

The first and second transactions contain the items "Weight=medium", "Mileage=high", and "Fuel=low". The third transaction contains the items "Weight=low", "Mileage=high", and "Fuel=low".

SEE ALSO:

,

EXAMPLES:

bd.assoc.rules(
    data.frame(aa=c("A","A","B","B","B"),
               bb=c("C","C","C","C","D"),
               stringsAsFactors=F),
    input.format="item.list")