Given two data frames:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
How can I do database style, i.e., sql style, joins? That is, how do I get:
- An inner join of
df1
anddf2
:
Return only the rows in which the left table have matching keys in the right table. - An outer join of
df1
anddf2
:
Returns all rows from both tables, join records from the left which have matching keys in the right table. - A left outer join (or simply left join) of
df1
anddf2
Return all rows from the left table, and any rows with matching keys from the right table. - A right outer join of
df1
anddf2
Return all rows from the right table, and any rows with matching keys from the left table.
Extra credit:
How can I do a SQL style select statement?
For the case of a left join with a
0..*:0..1
cardinality or a right join with a0..1:0..*
cardinality it is possible to assign in-place the unilateral columns from the joiner (the0..1
table) directly onto the joinee (the0..*
table), and thereby avoid the creation of an entirely new table of data. This requires matching the key columns from the joinee into the joiner and indexing+ordering the joiner's rows accordingly for the assignment.If the key is a single column, then we can use a single call to
match()
to do the matching. This is the case I'll cover in this answer.Here's an example based on the OP, except I've added an extra row to
df2
with an id of 7 to test the case of a non-matching key in the joiner. This is effectivelydf1
left joindf2
:In the above I hard-coded an assumption that the key column is the first column of both input tables. I would argue that, in general, this is not an unreasonable assumption, since, if you have a data.frame with a key column, it would be strange if it had not been set up as the first column of the data.frame from the outset. And you can always reorder the columns to make it so. An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded, although I suppose it's just replacing one assumption with another. Concision is another advantage of integer indexing, as well as speed. In the benchmarks below I'll change the implementation to use string name indexing to match the competing implementations.
I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table. Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient.
On the other hand, if you need the joinee to remain unaltered through this operation for whatever reason, then this solution cannot be used, since it modifies the joinee directly. Although in that case you could simply make a copy and perform the in-place assignment(s) on the copy.
As a side note, I briefly looked into possible matching solutions for multicolumn keys. Unfortunately, the only matching solutions I found were:
match(interaction(df1$a,df1$b),interaction(df2$a,df2$b))
, or the same idea withpaste()
.outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`)
.merge()
and equivalent package-based merge functions, which always allocate a new table to return the merged result, and thus are not suitable for an in-place assignment-based solution.For example, see Matching multiple columns on different data frames and getting other column as result, match two columns with two other columns, Matching on multiple columns, and the dupe of this question where I originally came up with the in-place solution, Combine two data frames with different number of rows in R.
Benchmarking
I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question.
Testing code:
Here's a benchmark of the example based on the OP that I demonstrated earlier:
Here I benchmark on random input data, trying different scales and different patterns of key overlap between the two input tables. This benchmark is still restricted to the case of a single-column integer key. As well, to ensure that the in-place solution would work for both left and right joins of the same tables, all random test data uses
0..1:0..1
cardinality. This is implemented by sampling without replacement the key column of the first data.frame when generating the key column of the second data.frame.I wrote some code to create log-log plots of the above results. I generated a separate plot for each overlap percentage. It's a little bit cluttered, but I like having all the solution types and join types represented in the same plot.
I used spline interpolation to show a smooth curve for each solution/join type combination, drawn with individual pch symbols. The join type is captured by the pch symbol, using a dot for inner, left and right angle brackets for left and right, and a diamond for full. The solution type is captured by the color as shown in the legend.
Here's a second large-scale benchmark that's more heavy-duty, with respect to the number and types of key columns, as well as cardinality. For this benchmark I use three key columns: one character, one integer, and one logical, with no restrictions on cardinality (that is,
0..*:0..*
). (In general it's not advisable to define key columns with double or complex values due to floating-point comparison complications, and basically no one ever uses the raw type, much less for key columns, so I haven't included those types in the key columns. Also, for information's sake, I initially tried to use four key columns by including a POSIXct key column, but the POSIXct type didn't play well with thesqldf.indexed
solution for some reason, possibly due to floating-point comparison anomalies, so I removed it.)The resulting plots, using the same plotting code given above:
dplyr since 0.4 implemented all those joins including
outer_join
, but it was worth noting that for the first few releases prior to 0.4 it used not to offerouter_join
, and as a result there was a lot of really bad hacky workaround user code floating around for quite a while afterwards (you can still find such code in SO, Kaggle answers, github from that period. Hence this answer still serves a useful purpose.)Join-related release highlights:
v0.5 (6/2016)
v0.4.0 (1/2015)
v0.3 (10/2014)
v0.2 (5/2014)
v0.1.3 (4/2014)
Workarounds per hadley's comments in that issue:
For an inner join on all columns, you could also use
fintersect
from the data.table-package orintersect
from the dplyr-package as an alternative tomerge
without specifying theby
-columns. this will give the rows that are equal between two dataframes:Example data:
By using the
merge
function and its optional parameters:Inner join:
merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specifymerge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use theby.x
andby.y
parameters if the matching variables have different names in the different data frames.Outer join:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer:
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer:
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join:
merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.You can merge on multiple columns by giving
by
a vector, e.g.,by = c("CustomerId", "OrderId")
.If the column names to merge on are not the same, you can specify, e.g.,
by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
whereCustomerId_in_df1
is the name of the column in the first data frame andCustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from
[.data.table
when passing second data.table as the first argument to subset, another way is to usemerge
function which dispatches to fast data.table method.Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets. Benchmark is performed on 50M-1 rows datasets, there are 50M-2 common values on join column so each scenario (inner, left, right, full) can be tested and join is still not trivial to perform. It is type of join which well stress join algorithms. Timings are as of
sqldf:0.4.11
,dplyr:0.7.8
,data.table:1.12.0
.Be aware there are other types of joins you can perform using
data.table
:- update on join - if you want to lookup values from another table to your main table
- aggregate on join - if you want to aggregate on key you are joining you do not have to materialize all join results
- overlapping join - if you want to merge by ranges
- rolling join - if you want merge to be able to match to values from preceeding/following rows by rolling them forward or backward
- non-equi join - if your join condition is non-equal
Code to reproduce:
You can do joins as well using Hadley Wickham's awesome dplyr package.
Mutating joins: add columns to df1 using matches in df2
Filtering joins: filter out rows in df1, don't modify columns