This function requires the bigdata library section to be loaded.
bd.join(data, key.columns=NULL, all.rows=F, suffixes, only.unmatched=F, natural=F, sort=T)
bdFrame
(s)
or
data.frame
(s).
NULL
, missing,
or an empty list, the join is done by row number. This argument should be a
list with as many elements as there are data sets specified in the
data
argument. Each element of the list must be a character vector specifying the name(s)
of the column(s) of the corresponding data set that are to be used as key columns.
Matching is done by position of the column name; i.e., the first column specified for the
first data set is matched with the first column specified for the second data set,
whether or not the names are the same.
TRUE
, include rows from input with keys that don't match
other inputs. can be a logical vector specifying which
inputs should have their unmatched rows included.
TRUE
, regardless of other settings (except natural),
ONLY unmatched rows are included in output.
TRUE
, regardless of other settings, a natural join
by row is executed. Columns are ignored if this is
TRUE
.
FALSE
, do not sort data by key columns first.
sort
to
FALSE
will not necessarily produce the correct
results. Set
sort
to
FALSE
ONLY if you know that the input columns are sorted, to improve the speed.
"bdFrame"
or
"data.frame"
,
(the same class as the contents of
x.lst
)
containing the columns from all the inputs.
This function accepts an arbitrary number of inputs and creates
output containing all the input columns. In the simplest case,
by ommitting the
key.columns
argument, a row-by-row join is executed.
In this scenario, each row of each input is transfered to the
output. To align the data from each input, one or more keys columns
may be specified for each input.
In the case where input column names conflict, a suffix is added. This
can be specified with the
suffixes
argument. If ommitted, the suffix
will be the input number. If adding the suffix does not create a unique
column name in the output, a series of "." are inserted between
the column name and the suffix to ensure uniqueness.
In both the key driven and row joins, unmatched rows can exist. In
the key driven join, a row is unmatched when its particular key combination
is not present in all of the inputs. In the row join, a row qualifies as
unmatched if any of the inputs have fewer rows than the current row number.
Whether these rows get output is controlled by the
all.rows
argument.
In a key driven join, if there are several matched rows sharing the same key, the cross-product of the rows is output. This can lead to an expansion of rows in the output.
In a key driven join, if the inputs are already sorted by the key columns,
setting
sort
to
FALSE
will speed up the join.
If the inputs are not sorted, you can use
bd.sort
to sort them
explicitly and pass them to
bd.join
, and then set
sort
to
FALSE
. For example,
bd.join(list(A,B,bd.sort(C,C.key.cols),bd.sort(D,D.key.cols)), sort=F)
## Join two data sets by row. ## The output has columns with suffixes, ## since there are common column names: ## Weight1, Disp.1, Weight2, Disp.2, Mileage bd.join(list(fuel.frame[,1:2], fuel.frame[,1:3]))
# Join two data sets by key columns # Here we add a "Price" column, # with values for Van and Small rows. # Since all.rows=T, we also keep rows # with other Type values, adding aPrice==NA bd.join(list(fuel.frame, data.frame(KnownTypes=c("Van", "Small"), Price=1000:1001)), key.columns=list("Type", "KnownTypes"), all.rows=T)
# Join two data sets by key columns, and the data sets are the same df1 <- data.frame(a=rep(c(T,F), 5), b=1:10, c=timeSpan(julian=1:10, ms=3*(1:10))) bdf1 <- as.bdFrame(df1) bd.join(list(bdf1, bdf1), key.columns=list(c("a", "b"),c("a", "b")))
# Recreate fuel.frame from pieces by executing a join by row. bd.join(list(fuel.frame[, 1:3], fuel.frame[, 4, drop=F], fuel.frame[, 5, drop=F]))
# View the results of multiple shuffles in one data set bd.join(list(fuel.frame, bd.shuffle(fuel.frame), bd.shuffle(fuel.frame)))
# Recreate fuel.frame from pieces by joining based on the first column bd.join(list(fuel.frame[,1:3], fuel.frame[,c(1,4)], fuel.frame[,c(1,5)]), 1)
# use all.rows as logical vector specifying which # inputs should have their unmatched rows included. xx <- bdFrame(x=1:3,y=101:103) yy <- bdFrame(x=3:6,z=1003:1006) # do not get any unmatched rows bd.join(list(xx,yy), key.columns="x") # get unmatched rows from all inputs bd.join(list(xx,yy), key.columns="x", all.rows=T) # get unmatched rows from first input bd.join(list(xx,yy), key.columns="x", all.rows=c(T,F)) # get unmatched rows from second input bd.join(list(xx,yy), key.columns="x", all.rows=c(F,T))