This function requires the bigdata library section to be loaded.
bd.assoc.rules(data, input.format="item.list", item.columns=NULL, id.columns=character(0), id.sort=T, prescan.items=F, init.items=NULL, min.support=0.1, min.confidence=0.8, min.rule.items=2, max.rule.items=5, output.rule.strings=T, output.rule.items=F, output.rule.sizes=F, output.measures=T, output.counts=F, rule.support.both=T, sort.output=c("lift", "rule"))
data.frame
or
bdFrame
.
The
input.format
argument specifies how
transactions are read from this object.
data
object.
It must be one of the four strings
"item.list"
,
"column.value"
,
"column.flag"
,
or
"transaction.id"
.
The four formats are described below.
NULL
, all columns in
data
are items.
"transaction.id"
input format.
"id.columns"
columns
before processing the
"transaction.id"
input format.
If the input data is already sorted,
this argument can be set to false to avoid sorting the data.
bd.assoc.rules
will construct a table of all
unique items that appear in the input transactions,
even if most of the items will not appear in rules
(because they do not appear in enough transactions).
If the input data contains many different items,
such as thousands of SKUs for retail data,
this may run out of memory and fail with an error.
Setting
prescan.items
to true
can avoid possible memory problems,
at the cost of additional runtime.
bd.assoc.rules
will ignore any items not on this list.
You can produce this vector by calling
to get raw item counts, and then selecting items of interest.
rule.support.both
argument.
min.rule.items
is set to 1,
bd.assoc.rules
may produce rules with one consequent and zero antecedents.
These are meaningful rules, indicating that the consequent item appears in a significant
number of transactions, regardless of what items appear with it.
"rule"
containng the generated rule, formatted as a string.
The example rule string
"aa <- bb cc"
is a rule with a single consequent item
"aa"
and two antecedent items
"bb"
and
"cc"
.
The antecedent items are always sorted alphabetically within a rule.
"con1"
containing the generated rule consequent,
and columns
"ant1"
,
"ant2"
, etc.,
containing the rule antecedents.
If a given rule has only one antecedent,
columns
"ant2"
and so on are empty strings.
Using these columns,
it is possible to process the rule items without parsing the rule strings.
"conSize"
is the number of consequent items in the rule (currently always 1),
"antSize"
is the number of antecedent items in the rule,
and
"ruleSize"
is the total number of items in the rule.
"support"
,
"confidence"
, and
"lift"
with calculated numeric measures for each generated rule.
These values are described below.
Note that the definition of rule support is effected
by the value of the
rule.support.both
argument.
"support"
,
as well as more complicated measures for the rules.
The raw count columns are:
"conCount"
is the number of input transactions containing the rule consequent,
"antCount"
is the number containing the antecedents,
"ruleCount"
is the number containing both consequent and antecedents,
"transCount"
is the total number of transactions in the input set,
and
"itemCount"
is the number of items used for creating rules.
The
"transCount"
and
"itemCount"
values are the same for every rule.
min.support
argument is interpreted,
as well as the
"support"
values output
when
output.measures
is true.
c("lift", "rule")
,
will sort rules with the highest lift values first,
and sort rules with the same lift value in alphabetical order.
If this includes column names that are not in the result,
these column names will be ignored.
bdFrame
or
data.frame
of the same type as the input "data" argument,
containing one generated rule per row.
The arguments
output.rule.strings
,
output.rule.items
, etc., specify
which columns appear in each output row.
bd.assoc.rules
takes a dataset of "transactions", each
containing a set of "items", and generates a set of association rules
indicating which items appear with (are "associated with") which other
items.
The significance of each generated rule is indicated by the various
measures output when
output.measures
is true.
The
confidence
and
lift
measure values can be calculated from the
output.counts
values, using the following equations:
confidence = ruleCount / antCount lift = (ruleCount / antCount) / (conCount / transCount)
If the
rule.support.both
argument is true (the default value),
the
support
measure is calculated using the following equation:
support = ruleCount / transCount
If the
rule.support.both
argument is false,
the
support
measure is calculated using the following equation:
support = antCount / transCount
The
output.counts
count values also could be used to compute other measures.
The
bd.assoc.rules
function uses the apriori algorithm
code from
http://www.borgelt.net/apriori.html
. The
original LGPL source code and the modified source code used by this
function is included in the
SHOME/library/bigdata/apriori/
directory. The URL above
also contains more information on association rules and the apriori
algorithm.
bd.assoc.rules
processes input data consisting of a
series of "transactions", where each transaction contains a set of
"items". The argument
input.format
specifies how
transactions are repesented in the
data
object. It must
be one of the four strings
"item.list"
,
"column.value"
,
"column.flag"
, or
"transaction.id"
:
Input format
"item.list"
:
In this input format, each input row represents one transaction. The transaction items are all non-NA, non-empty strings in the item columns. In this format, there must be enough columns to handle the maximum number of items in a single transaction.
For example, consider the following four-column, two-row dataset:
i1 i2 i3 i4 milk cheese bread meat bread
The first transaction contains items "milk", "cheese", and "bread", and the second transaction contains items "meat" and "bread".
Input format
"column.flag"
:
In this input format, each input row represents one transaction. The column names are the item names, and each column's item is included in the transaction if the column's value is "flagged". More specifically, if an item column is numeric, it is flagged if its value is anything other than 0.0 or NA. If the column is a string or factor, the item is flagged if the value is anything other than "0", NA, or an empty string.
For example, the the following two-row dataset represents the same transactions as the example above:
bread meat cheese milk cereal chips dip 1 0 1 1 0 0 0 1 1 0 0 0 0 0
This format is not suitable for data where there are a large number of possible items, such as a retail market basket analysis with thousands of SKUs, since it would require one column for each SKU.
Input format
"transaction.id"
:
In this input format, each transaction is represented by one or more
rows. Each row has one or more columns identifying the transaction
(specified by the
id.columns
argument), along with one or
more columns containing items as in the
"item.list"
format. All of the rows with the same transaction ID contain items
for a single transaction. This is a very efficient format when
individual transactions can have a large number of items, and when
there are many possible distinct items.
For example, the following dataset represents the same transactions as the example above:
id item 10001 bread 10001 cheese 10001 milk 10002 meat 10002 bread
Input format
"column.value"
:
In this input format, each input row represents one transaction.
Items are created by combining column names and column values to
produce strings of the form
"name=val"
. This is useful
for applying association rules to surveys where the results are
encoded into a set of factor values.
The following dataset represents three transactions:
Weight Mileage Fuel medium high low medium high low low high low
The first and second transactions contain the items
"Weight=medium"
,
"Mileage=high"
, and
"Fuel=low"
. The third transaction contains the items
"Weight=low"
,
"Mileage=high"
, and
"Fuel=low"
.
bd.assoc.rules( data.frame(aa=c("A","A","B","B","B"), bb=c("C","C","C","C","D"), stringsAsFactors=F), input.format="item.list")