Merge Two Datasets and Match Columns

DESCRIPTION:

Takes two data frames or bdFrames and the names or numbers of a set of columns in each of them to match (the by columns). It returns a new data frame or bdFrame that has a row for each pair of rows in x and y, and whose by columns have the same values. This row contains all the columns in both x and y, except only one copy of the by columns appear. (In database language this is called the join of two relations.) You might have one to one, many to one, or many to many matching. Note that x and y need not have similar dimensions, but the columns to match by should contain similar data.

In some cases, you might want to include all the rows in one or both data sets or bdFrames in the output, even if there is not a matching row in the other. The all.x and all.y arguments let you do this. In the rows without matches, merge puts NAs in the columns with no matching data.

This is a generic function, but currently the only substantive method for it works on data frames and bdFrames. The default method converts x and y to data frames or bdFrames and calls the method for data frames or bdFrames.

USAGE:

merge(x, y, 
      by = intersect(names(x), names(y)),  
      by.x = by, by.y = by, 
      all = F, all.x = all, all.y = all, 
      suffixes = c(".x", ".y")) 

REQUIRED ARGUMENTS:

x
a data frame or bdFrame, or something to be converted into a data frame or bdFrame.
y
a data frame or bdFrame, or something to be converted into a data frame or bdFrame.

OPTIONAL ARGUMENTS:

by
a vector of columns to match by. This can be a vector of column names, column numbers, or a logical vector with a T or F for each column, telling which columns to match by. The special name row.names means to match by the row names of the data frames or bdFrames. In that case a new column will be formed called Row.names. If you supply by it will be used for both x and y. If the by columns have different names or locations in x and y then use by.x and by.y. The default value is the vector of column names that are common to x and y.
by.x
See by.
by.y
See by.
all
Shorthand for all.x=T and all.y=T.
all.x
a logical value; if TRUE, extra rows are added to the output, one for each row in x that has no matching row in y. These rows have NA's in the columns that are usually filled with values from y. The default is FALSE, so only rows with data from both x and y are included in the output.
all.y
a logical value. Analogous to all.x, controlling when the output contains rows for y rows with no matching x row.
suffixes
A character vector containing two distinct strings. If x and y have some column names in common, and those columns are not used for matching, the output has two columns with the same name, which is not allowed for data frames or bdFrames. merge pastes suffixes on these repeated column names to make them unique. The default is c(".x",".y").

VALUE:

a data frame or bdFrame with the by columns first then the remaining columns of x and y.

SEE ALSO:

, , , , .

EXAMPLES:

# Create 2 data frames, one with information on authors 
# and one concerning books.  Use merge to relate the 
# names of the books written to attributes of the authors. 
# Note that some authors have no books listed and some have 
# several books. 
authors <- data.frame( 
    FirstName=c("Lorne", "Loren", "Robin",  
                  "Robin", "Billy"), 
    LastName=c("Green", "Jaye", "Green",  
                 "Howe", "Jaye"), 
    Age=c(82, 40, 45, 2, 40),  
    Income=c(1200000, 40000, 25000, 0, 27500),  
    Home=c("California", "Washington", "Washington",  
        "Alberta", "Washington")) 
  
books <- data.frame( 
    AuthorFirstName=c("Lorne", "Loren", "Loren", 
        "Loren", "Robin", "Rich"), 
    AuthorLastName=c("Green", "Jaye", "Jaye", "Jaye", 
        "Green", "Calaway"), 
    Book=c("Bonanza", "Midwifery", "Gardening", 
        "Perennials", "Who_dun_it?", "Splus")) 
# Look at all cases in which the author is in both the  
# authors and books datasets.  Match author by both first  
# and last names -- these have different labels in the 2  
# datasets but are in the first 2 columns of both. 
merge(authors, books, by=1:2) 
# Produces the following output: 
#   FirstName LastName Age  Income       Home 
# 1     Lorne    Green  82 1200000 California 
# 2     Loren     Jaye  40   40000 Washington 
# 3     Loren     Jaye  40   40000 Washington 
# 4     Loren     Jaye  40   40000 Washington 
# 5     Robin    Green  45   25000 Washington 
#          Book 
# 1     Bonanza 
# 2   Midwifery 
# 3   Gardening 
# 4  Perennials 
# 5 Who_dun_it? 

# Next, make sure all authors in the authors dataset are  
# listed, even if there is no book listed for them.  Using  
# by.x and by.y may be a more reliable way to handle 
# cases in which the datasets have different  names for
# the columns to match by. 
merge(authors, books, by.x=c("FirstName", "LastName"), 
      by.y=c("AuthorFirstName", "AuthorLastName"),  
      all.x=T) 
# Produces the following: 
#   FirstName LastName Age  Income       Home 
# 1     Billy     Jaye  40   27500 Washington 
# 2     Lorne    Green  82 1200000 California 
# 3     Loren     Jaye  40   40000 Washington 
# 4     Loren     Jaye  40   40000 Washington 
# 5     Loren     Jaye  40   40000 Washington 
# 6     Robin    Green  45   25000 Washington 
# 7     Robin     Howe   2       0    Alberta 
#          Book 
# 1          NA 
# 2     Bonanza 
# 3   Midwifery 
# 4   Gardening 
# 5  Perennials 
# 6 Who_dun_it? 
# 7          NA 

# Use the state.x77 dataset to relate the income of the author  
# to the median income of his or her home state (we have no  
# information on Alberta, a Canadian province).  Note the use  
# of "row.names" where the "column" to match on is not a variable  
# in the dataset but is the names of the rows.  Both datasets have  
# a column called "Income" which is not a key variable, so supply 
# the suffixes argument to distinguish between them in the 
# output (without suffixes they would be labeled "Income.x" 
# and "Income.y"). 
state.data <- data.frame(state.x77) 
merge(authors, state.data[, "Income", drop=F], 
      by.x="Home", by.y="row.names", all.x=T, 
      suffixes=c("Author", "State")) 
# Produces the following: 
#   FirstName LastName Age IncomeAuthor       Home 
# 1     Robin     Howe   2            0    Alberta 
# 2     Lorne    Green  82      1200000 California 
# 3     Loren     Jaye  40        40000 Washington 
# 4     Robin    Green  45        25000 Washington 
# 5     Billy     Jaye  40        27500 Washington 
#   IncomeState 
# 1          NA 
# 2        5114 
# 3        4864 
# 4        4864 
# 5        4864