I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:
# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)
# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.
# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))
x y.x y.y y.x y.y
1: a 10 11 12 13
2: b 11 12 13 14
3: c 12 13 14 15
4: d 13 14 15 16
5: e 14 15 16 17
6: f 15 16 17 18
I have a workaround, where, if I change the 2nd column name for DT1:
setnames(DT1, "y", "new_y")
# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?
Here's a way of keeping a counter within
Reduce
, if you want to rename during the merge:Another way of doing this:
The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.
Alternatively you could
setNames
for the columns before and domerge
like thisstack and reshape I don't think this maps exactly to the
merge
function but...I have no sense for if this would extend to having more columns than
y
.merge-assign
(I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of
merge
. In the OP's case, withmycols="x"
andx
being the same across allDT*
, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)If it's just those 5 datatables (where
x
is the same for all datatables), you could also use nested joins:Or as @Frank said in the comments:
which gives:
This gives the same result as:
When your
x
columns do not have the same values, a nested join will not give the desired solution:this gives:
While:
gives:
Used data:
In order to make the code with
Reduce
work, I changed the names of they
columns.Using reshaping gives you a lot more flexibility in how you want to name your columns.
Or, this would work