可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
What is the most efficient way to update/replace NAs in main dataset with (correct) values in a lookup table? This is such a common operation! Similar questions do not seem to have tidy solutions.
Constraints:
1) Please assume a large number of missing values and bigger lookup table than the example given. So case-wise replacement operations would be impractical (no case_when
, if_else
, etc.)
2)The lookup table does not have all values of main dataframe, only the replacement ones.
Tidyverse solution answer much preferred. Similar questions do not seem to have tidy solutions.
library(tidyverse)
### Main Dataframe ###
df1 <- tibble(
state_abbrev = state.abb[1:10],
state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
value = sample(500:1200, 10, replace=TRUE)
)
#> # A tibble: 10 x 3
#> state_abbrev state_name value
#> <chr> <chr> <int>
#> 1 AL Alabama 551
#> 2 AK Alaska 765
#> 3 AZ Arizona 508
#> 4 AR Arkansas 756
#> 5 CA California 741
#> 6 CO <NA> 1100
#> 7 CT <NA> 719
#> 8 DE <NA> 874
#> 9 FL Florida 749
#> 10 GA Georgia 580
### Lookup Dataframe ###
lookup_df <- tibble(
state_abbrev = state.abb[6:8],
state_name = state.name[6:8]
)
#> # A tibble: 3 x 2
#> state_abbrev state_name
#> <chr> <chr>
#> 1 CO Colorado
#> 2 CT Connecticut
#> 3 DE Delaware
Ideally, a left_join would have a replacement option for missing values. Alas...
left_join(df1, lookup_df)
#> Joining, by = c("state_abbrev", "state_name")
#> # A tibble: 10 x 3
#> state_abbrev state_name value
#> <chr> <chr> <int>
#> 1 AL Alabama 551
#> 2 AK Alaska 765
#> 3 AZ Arizona 508
#> 4 AR Arkansas 756
#> 5 CA California 741
#> 6 CO <NA> 1100
#> 7 CT <NA> 719
#> 8 DE <NA> 874
#> 9 FL Florida 749
#> 10 GA Georgia 580
```
Created on 2018-07-28 by the reprex package (v0.2.0).
回答1:
Picking up Alistaire's and Nettle's suggestions and transforming into a working solution
df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
# A tibble: 10 x 3
state_abbrev value state_name
<chr> <int> <chr>
1 AL 671 Alabama
2 AK 501 Alaska
3 AZ 1030 Arizona
4 AR 694 Arkansas
5 CA 881 California
6 CO 821 Colorado
7 CT 742 Connecticut
8 DE 665 Delaware
9 FL 948 Florida
10 GA 790 Georgia
The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table
package:
library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
state_abbrev state_name value
1: AL Alabama 1103
2: AK Alaska 1036
3: AZ Arizona 811
4: AR Arkansas 604
5: CA California 868
6: CO Colorado 1129
7: CT Connecticut 819
8: DE Delaware 1194
9: FL Florida 888
10: GA Georgia 501
Benchmark
library(bench)
bm <- press(
na_share = c(0.1, 0.5, 0.9),
n_row = length(state.abb) * 2 * c(1, 100, 10000),
{
n_na <- na_share * length(state.abb)
set.seed(1)
na_idx <- sample(length(state.abb), n_na)
tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
lookup_df <-tmp[na_idx]
tmp[na_idx, state_name := NA]
df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
mark(
dplyr = {
df1 <- copy(df0)
df1 <- df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
df1
},
upd_join = {
df1 <- copy(df0)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
}
)
}
)
ggplot2::autoplot(bm)
data.table
's upate join is always faster (note the log time scale).
As the update join modifies the data object, a fresh copy is used for each benchmark run.
回答2:
There's currently no one-shot for trying to coalesce more than one column (which can be done by using a lookup table approach within ifelse(is.na(value), ..., value)
), though there has been discussion of how such behavior may be implemented. For now, you can build it manually. If you've got a lot of columns, you can coalesce
programmatically, or even put it in a function.
library(tidyverse)
df1 <- tibble(
state_abbrev = state.abb[1:10],
state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
value = sample(500:1200, 10, replace=TRUE)
)
lookup_df <- tibble(
state_abbrev = state.abb[6:8],
state_name = state.name[6:8]
)
df1 %>%
full_join(lookup_df, by = 'state_abbrev') %>%
bind_cols(map_dfc(grep('.x', names(.), value = TRUE), function(x){
set_names(
list(coalesce(.[[x]], .[[gsub('.x', '.y', x)]])),
gsub('.x', '', x)
)
})) %>%
select(union(names(df1), names(lookup_df)))
#> # A tibble: 10 x 3
#> state_abbrev state_name value
#> <chr> <chr> <int>
#> 1 AL Alabama 877
#> 2 AK Alaska 1048
#> 3 AZ Arizona 973
#> 4 AR Arkansas 860
#> 5 CA California 938
#> 6 CO Colorado 639
#> 7 CT Connecticut 547
#> 8 DE Delaware 672
#> 9 FL Florida 667
#> 10 GA Georgia 1142
回答3:
in order to preserve the column order:
df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name.x = coalesce(state_name.x, state_name.y)) %>%
rename(state_name = state_name.x) %>%
select(-state_name.y)
回答4:
If the abbreviation column is complete and the lookup table is complete, could you just drop the state_name column and then join?
left_join(df1 %>% select(-state_name), lookup_df, by = 'state_abbrev') %>%
select(state_abbrev, state_name, value)
Another option could be to use match
and if_else
in a mutate
call using the built in state name and abbreviation lists:
df1 %>%
mutate(state_name = if_else(is.na(state_name), state.name[match(state_abbrev,state.abb)], state_name))
Both give the same output:
# A tibble: 10 x 3
state_abbrev state_name value
<chr> <chr> <int>
1 AL Alabama 525
2 AK Alaska 719
3 AZ Arizona 1186
4 AR Arkansas 1051
5 CA California 888
6 CO Colorado 615
7 CT Connecticut 578
8 DE Delaware 894
9 FL Florida 536
10 GA Georgia 599