Takes two data frames or
bdFrames and the names or numbers of a set of
columns in each of them to match (the
by columns). It returns a
new data frame or
bdFrame that has a row for each pair of rows in
x and
y, and whose
by columns have the same values. This row contains all the
columns in both
x and
y, except only one copy of the
by columns appear. (In database language this is called the join of two
relations.) You might have one to one, many to one, or many to many
matching. Note that
x and
y need not have similar dimensions,
but the columns to match by should contain similar data.
In some cases, you might want to include all the rows in one or both
data sets or
bdFrames in the output, even if there is not a matching row in
the other. The
all.x and
all.y arguments let you do this. In
the rows without matches,
merge puts NAs in the columns with
no matching data.
This is a generic function, but currently the only substantive method
for it works on data frames and
bdFrames. The default method converts
x and
y to data frames or
bdFrames and calls the method for data frames or
bdFrames.
USAGE:
merge(x, y,
by = intersect(names(x), names(y)),
by.x = by, by.y = by,
all = F, all.x = all, all.y = all,
suffixes = c(".x", ".y"))
REQUIRED ARGUMENTS:
x
a data frame or
bdFrame, or something to be converted into a data frame or
bdFrame.
y
a data frame or
bdFrame, or something to be converted into a data frame or
bdFrame.
OPTIONAL ARGUMENTS:
by
a vector of columns to match by. This can be a vector of column
names, column numbers, or a logical vector with a
T or
F for
each column, telling which columns to match by.
The special name
row.names means to match by the row names
of the data frames or
bdFrames. In that case a new column will be formed
called
Row.names.
If you supply
by it will be used for both
x and
y. If the
by columns have different names or locations in
x and
y then
use
by.x and
by.y.
The default value is the vector of column names that are common
to
x and
y.
by.x
See
by.
by.y
See
by.
all
Shorthand for
all.x=T and
all.y=T.
all.x
a logical value; if
TRUE, extra rows are added to the output,
one for each row in
x that has no matching row in
y. These rows
have NA's in the columns that are usually filled with
values from
y.
The default is
FALSE, so only rows with data from both
x and
y are included in the output.
all.y
a logical value. Analogous to
all.x, controlling when the
output contains rows for
y rows with no matching
x row.
suffixes
A character vector containing two distinct strings.
If
x and
y have some column names in common, and those columns
are not used for matching, the output has two columns
with the same name, which is not allowed for data frames or
bdFrames.
merge pastes
suffixes on these repeated column names
to make them unique. The default is
c(".x",".y").
VALUE:
a data frame or
bdFrame with the
by columns first then the remaining
columns of
x and
y.
SEE ALSO:
,
,
,
,
.
EXAMPLES:
# Create 2 data frames, one with information on authors
# and one concerning books. Use merge to relate the
# names of the books written to attributes of the authors.
# Note that some authors have no books listed and some have
# several books.
authors <- data.frame(
FirstName=c("Lorne", "Loren", "Robin",
"Robin", "Billy"),
LastName=c("Green", "Jaye", "Green",
"Howe", "Jaye"),
Age=c(82, 40, 45, 2, 40),
Income=c(1200000, 40000, 25000, 0, 27500),
Home=c("California", "Washington", "Washington",
"Alberta", "Washington"))
books <- data.frame(
AuthorFirstName=c("Lorne", "Loren", "Loren",
"Loren", "Robin", "Rich"),
AuthorLastName=c("Green", "Jaye", "Jaye", "Jaye",
"Green", "Calaway"),
Book=c("Bonanza", "Midwifery", "Gardening",
"Perennials", "Who_dun_it?", "Splus"))
# Look at all cases in which the author is in both the
# authors and books datasets. Match author by both first
# and last names -- these have different labels in the 2
# datasets but are in the first 2 columns of both.
merge(authors, books, by=1:2)
# Produces the following output:
# FirstName LastName Age Income Home
# 1 Lorne Green 82 1200000 California
# 2 Loren Jaye 40 40000 Washington
# 3 Loren Jaye 40 40000 Washington
# 4 Loren Jaye 40 40000 Washington
# 5 Robin Green 45 25000 Washington
# Book
# 1 Bonanza
# 2 Midwifery
# 3 Gardening
# 4 Perennials
# 5 Who_dun_it?
# Next, make sure all authors in the authors dataset are
# listed, even if there is no book listed for them. Using
# by.x and by.y may be a more reliable way to handle
# cases in which the datasets have different names for
# the columns to match by.
merge(authors, books, by.x=c("FirstName", "LastName"),
by.y=c("AuthorFirstName", "AuthorLastName"),
all.x=T)
# Produces the following:
# FirstName LastName Age Income Home
# 1 Billy Jaye 40 27500 Washington
# 2 Lorne Green 82 1200000 California
# 3 Loren Jaye 40 40000 Washington
# 4 Loren Jaye 40 40000 Washington
# 5 Loren Jaye 40 40000 Washington
# 6 Robin Green 45 25000 Washington
# 7 Robin Howe 2 0 Alberta
# Book
# 1 NA
# 2 Bonanza
# 3 Midwifery
# 4 Gardening
# 5 Perennials
# 6 Who_dun_it?
# 7 NA
# Use the state.x77 dataset to relate the income of the author
# to the median income of his or her home state (we have no
# information on Alberta, a Canadian province). Note the use
# of "row.names" where the "column" to match on is not a variable
# in the dataset but is the names of the rows. Both datasets have
# a column called "Income" which is not a key variable, so supply
# the suffixes argument to distinguish between them in the
# output (without suffixes they would be labeled "Income.x"
# and "Income.y").
state.data <- data.frame(state.x77)
merge(authors, state.data[, "Income", drop=F],
by.x="Home", by.y="row.names", all.x=T,
suffixes=c("Author", "State"))
# Produces the following:
# FirstName LastName Age IncomeAuthor Home
# 1 Robin Howe 2 0 Alberta
# 2 Lorne Green 82 1200000 California
# 3 Loren Jaye 40 40000 Washington
# 4 Robin Green 45 25000 Washington
# 5 Billy Jaye 40 27500 Washington
# IncomeState
# 1 NA
# 2 5114
# 3 4864
# 4 4864
# 5 4864