When joining data.frames along a key, and one key has a missing value (NA), my intuition was that rows with an NA key should have no match in the second data.frame. To my surprise, if there are NAs in both data.frames, dplyr matches them as if they were values.
This is extra confusing because this was discussed at length on the issues in the dplyr repository see here and it seems to be solved! If so, I'm not seeing how this is the correct solution ; or perhaps I'm missing something
I'm using dplyr 0.7.4
t1 <- data.frame(a = as.character(c("1", "2", NA, NA, "4", "2")), b = c(1, 2, 3, 3, 4, 5), stringsAsFactors = FALSE)
t2 <- data.frame(a = as.character(c("1", "2", NA)), c = c("b", "n", "i"), stringsAsFactors = FALSE)
library(dplyr)
t1
#> a b
#> 1 1 1
#> 2 2 2
#> 3 <NA> 3
#> 4 <NA> 3
#> 5 4 4
#> 6 2 5
t2
#> a c
#> 1 1 b
#> 2 2 n
#> 3 <NA> i
left_join(t1, t2, by = "a")
#> a b c
#> 1 1 1 b
#> 2 2 2 n
#> 3 <NA> 3 i
#> 4 <NA> 3 i
#> 5 4 4 <NA>
#> 6 2 5 n
When in fact I would have expected the following:
#> a b c
#> 1 1 1 b
#> 2 2 2 n
#> 3 <NA> 3 <NA>
#> 4 <NA> 3 <NA>
#> 5 4 4 <NA>
#> 6 2 5 n