Merge data.frames cause match.names error

2020-07-27 05:04发布

问题:

I need to merge many data.frames. Below the sample of the code to reproduce an error. It looks like a bug.

This code works well:

df1 <- data.frame(v=1:10, v2=rev(1:10))
df2 <- data.frame(vv=1:8, v2=rev(5:12))
df12 <- merge(x=df1, y=df2, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))
df3 <- data.frame(w=2:6, v2=3:7)
df123 <- merge(x=df12, y=df3, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))
df4 <- data.frame(x=1:6, v2=1:6)
df1234 <- merge(x=df123, y=df4, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))

This code produce the error message on the last line: Error in match.names(clabs, names(xi)) : names do not match previous names. The only change is that nrow(df4) > nrow(df123)

df1 <- data.frame(v=1:10, v2=rev(1:10))
df2 <- data.frame(vv=1:8, v2=rev(5:12))
df12 <- merge(x=df1, y=df2, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))
df3 <- data.frame(w=2:6, v2=3:7)
df123 <- merge(x=df12, y=df3, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))
df4 <- data.frame(x=1:16, v2=1:16)
df1234 <- merge(x=df123, y=df4, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))

Let's see names of columns of df123

names(df123)
[1] "v"    "v2.x" "v2.y" "v2" 

Then change the last name on arbitrary one

names(df123)[4] <- "v3"

And now this line of code will work correctly

df1234 <- merge(x=df123, y=df4, by.x=1, by.y=1, all=TRUE, suffixes=c(".x", ".y"))

Is it bug? I used R 2.13.1 on Win7. If you need some other information, I'll add it to the question.

回答1:

This is definitely a bug, I tested it in R 2.14.1 on Windows 7, but I doubt the operating system matters. I recreated a "smaller" test case of the bug here:

# Create data.
df1=data.frame(rbind(c(1,10,12,NA)))
df2=data.frame(rbind(c(11,11)))

# Works fine.
merge(df1,df2,by=1,all=T)

#   X1 X2.x X3 X4 X2.y
# 1  1   10 12 NA   NA
# 2 11   NA NA NA   11

# Change the names of the columns.
names(df1)= c('v','v2.x','v2.y','v2')
names(df2)= c('x','v2')

# Same data fails!
merge(df1,df2,by=1,all=T)

# Error in match.names(clabs, names(xi)) : 
#   names do not match previous names

The error occurs in the "merge.data.frame" method, on this line:

x <- rbind(x, ya)

The problem is that "x" and "ya" don't share the same column names. That problem occurs on this line, just two lines before the previous one:

ya <- cbind(ya, x[rep.int(NA_integer_, nyy), nm.x, drop = FALSE])

"nm.x" is a set of names c("v2.x","v2.y","v2.x"). and x is a data.frame with two columns with the name 'v2.x'. Interestingly, when you select the columns from this data.frame, it appears to rename one of the columns!

names(x)
[1] "v"    "v2.x" "v2.y" "v2.x"
nm.x
[1] "v2.x" "v2.y" "v2.x"
x[,nm.x]
  v2.x v2.y v2.x.1
1   10   12     10

I tried to solve this by using the position of the column, instead of the name, but the resulting name is still changed (but the values are now what you want)!

x[,c(2,3,4)]
  v v2.x v2.y v2.x.1
1 1   10   12   NA

I have posted this as a bug.