Join Multiple Inputs

DESCRIPTION:

Create composite dataset from several (2 or more) inputs. For each input, a set of key columns can be specified which will define which rows get combined in the output. Also, for each input, whether to output unmatched rows can be specified.

This function requires the bigdata library section to be loaded.

USAGE:

bd.join(data, key.columns=NULL, all.rows=F, suffixes,
     only.unmatched=F, natural=F, sort=T)

REQUIRED ARGUMENTS:

data
input data set or list of data sets, bdFrame(s) or data.frame(s).

OPTIONAL ARGUMENTS:

key.columns
key columns from data sets. If this is NULL, missing, or an empty list, the join is done by row number. This argument should be a list with as many elements as there are data sets specified in the data argument. Each element of the list must be a character vector specifying the name(s) of the column(s) of the corresponding data set that are to be used as key columns. Matching is done by position of the column name; i.e., the first column specified for the first data set is matched with the first column specified for the second data set, whether or not the names are the same.
all.rows
determines whether unmatched rows are output for each data set. If TRUE, include rows from input with keys that don't match other inputs. can be a logical vector specifying which inputs should have their unmatched rows included.
suffixes
suffixes added to columns with the same name in input data sets.
only.unmatched
if TRUE, regardless of other settings (except natural), ONLY unmatched rows are included in output.
natural
if TRUE, regardless of other settings, a natural join by row is executed. Columns are ignored if this is TRUE.
sort
if FALSE, do not sort data by key columns first.

If the inputs are not already sorted, setting sort to FALSE will not necessarily produce the correct results. Set sort to FALSE ONLY if you know that the input columns are sorted, to improve the speed.

VALUE:

an object of class "bdFrame" or "data.frame", (the same class as the contents of x.lst) containing the columns from all the inputs.

DETAILS:

This function accepts an arbitrary number of inputs and creates output containing all the input columns. In the simplest case, by ommitting the key.columns argument, a row-by-row join is executed. In this scenario, each row of each input is transfered to the output. To align the data from each input, one or more keys columns may be specified for each input.

In the case where input column names conflict, a suffix is added. This can be specified with the suffixes argument. If ommitted, the suffix will be the input number. If adding the suffix does not create a unique column name in the output, a series of "." are inserted between the column name and the suffix to ensure uniqueness.

In both the key driven and row joins, unmatched rows can exist. In the key driven join, a row is unmatched when its particular key combination is not present in all of the inputs. In the row join, a row qualifies as unmatched if any of the inputs have fewer rows than the current row number. Whether these rows get output is controlled by the all.rows argument.

In a key driven join, if there are several matched rows sharing the same key, the cross-product of the rows is output. This can lead to an expansion of rows in the output.

In a key driven join, if the inputs are already sorted by the key columns, setting sort to FALSE will speed up the join.

If the inputs are not sorted, you can use bd.sort to sort them explicitly and pass them to bd.join, and then set sort to FALSE. For example, bd.join(list(A,B,bd.sort(C,C.key.cols),bd.sort(D,D.key.cols)), sort=F)

SEE ALSO:

, ,

EXAMPLES:

## Join two data sets by row.
## The output has columns with suffixes,
## since there are common column names:
##   Weight1, Disp.1, Weight2, Disp.2, Mileage
bd.join(list(fuel.frame[,1:2], fuel.frame[,1:3]))
# Join two data sets by key columns
# Here we add a "Price" column,
# with values for Van and Small rows.
# Since all.rows=T, we also keep rows
# with other Type values, adding aPrice==NA
bd.join(list(fuel.frame, data.frame(KnownTypes=c("Van", "Small"),
     Price=1000:1001)), key.columns=list("Type", "KnownTypes"),
     all.rows=T)
# Join two data sets by key columns, and the data sets are the same
df1 <- data.frame(a=rep(c(T,F), 5), b=1:10, c=timeSpan(julian=1:10,
     ms=3*(1:10)))
bdf1 <- as.bdFrame(df1)
bd.join(list(bdf1, bdf1), key.columns=list(c("a", "b"),c("a", "b")))
# Recreate fuel.frame from pieces by executing a join by row.
bd.join(list(fuel.frame[, 1:3], fuel.frame[, 4, drop=F],
     fuel.frame[, 5, drop=F]))
# View the results of multiple shuffles in one data set
bd.join(list(fuel.frame, bd.shuffle(fuel.frame), bd.shuffle(fuel.frame)))
# Recreate fuel.frame from pieces by joining based on the first column
bd.join(list(fuel.frame[,1:3], fuel.frame[,c(1,4)], fuel.frame[,c(1,5)]), 1)
# use all.rows as logical vector specifying which
# inputs should have their unmatched rows included.
xx <- bdFrame(x=1:3,y=101:103)
yy <- bdFrame(x=3:6,z=1003:1006)
# do not get any unmatched rows
bd.join(list(xx,yy), key.columns="x")
# get unmatched rows from all inputs
bd.join(list(xx,yy), key.columns="x", all.rows=T)
# get unmatched rows from first input
bd.join(list(xx,yy), key.columns="x", all.rows=c(T,F))
# get unmatched rows from second input
bd.join(list(xx,yy), key.columns="x", all.rows=c(F,T))