可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm wrangling some data where we sort fails into bins and compute limited yields for each sort bin by lot.
I have a meta table that describes the sort bins. The rows are arranged in ascending test order and some of the sort labels come in with non-syntactic names.
sort_tbl <- tibble::tribble(~weight, ~label,
0, "fail A",
0, "fail B",
0, "fail C",
100, "pass")
> sort_tbl
# A tibble: 4 x 2
weight label
<dbl> <chr>
1 0 fail A
2 0 fail B
3 0 fail C
4 100 pass
I have a data table of limited yield by sort bin with one row per lot and one col for each sort bin. Because this table was constructed from a transposition we get instances where a particular sort never occurred for a lot and the resulting value is NA
. Note that the columns in this table are arranged in descending test order.
yld_tbl <- tibble::tribble( ~lot, ~pass, ~`fail C`, ~`fail B`, ~`fail A`,
"lot1", NA, NA, 0.00, NA,
"lot2", NA, 0.00, 0.80, NA,
"lot3", 0.49, NA, 0.50, 0.98,
"lot4", 0.70, 0.95, 0.74, 0.99)
> yld_tbl
# A tibble: 4 x 5
lot pass `fail C` `fail B` `fail A`
<chr> <dbl> <dbl> <dbl> <dbl>
1 lot1 NA NA 0.00 NA
2 lot2 NA 0.00 0.80 NA
3 lot3 0.49 NA 0.50 0.98
4 lot4 0.70 0.95 0.74 0.99
Some of the missing values imply a limited yield of 100% while others reflect an undefined value because we are zero yield earlier in the flow. My task is to replace the former group of NA
's with 1.00
as appropriate.
One algorithm to accomplish this works left to right (descending test order) replacing NA
with 1.00
if the subsequent limited yield is not NA
. In the first row of the example data set, we don't change fail C
since pass
is missing. But we do replace fail A
with 1.00
since fail B
is not missing.
The correct example output would be:
> fill_ones(yld_tbl, sort_tbl)
# A tibble: 4 x 5
lot pass `fail C` `fail B` `fail A`
<chr> <dbl> <dbl> <dbl> <dbl>
1 lot1 NA NA 0.00 1.00
2 lot2 NA 0.00 0.80 1.00
3 lot3 0.49 1.00 0.50 0.98
4 lot4 0.70 0.95 0.74 0.99
回答1:
This problem becomes a bit easier if you think of it as "first replace all the NAs with 1, then replace all 1s after the first 0 with NA."
Here are two approaches, one using matrix operations and one using dplyr.
In the matrix approach, you'd extract the values as a numeric matrix, use apply
to find the positions that need to be replaced with NA, and return them.
# extract as a matrix, with left-to-right bins
m <- as.matrix(yld_tbl[, sort_tbl$label])
# replace NAs with 1
m[is.na(m)] <- 1
# find 1s happening after a zero in each row
after_zero <- t(apply(m == 0, 1, cumsum)) & (m == 1)
# replace them with NA
m[after_zero] <- NA
# return them in the table
yld_tbl[, sort_tbl$label] <- m
Using dplyr/tidyr, you'd first gather()
the columns (using arrange()
to put them in the desired order), replace the NAs (the group_by
/mutate
is accomplishing the same thing as apply
above), and spread
them back into a wide format.
library(dplyr)
library(tidyr)
yld_tbl %>%
gather(label, value, -lot) %>%
arrange(lot, match(label, sort_tbl$label)) %>%
replace_na(list(value = 1)) %>%
group_by(lot) %>%
mutate(value = ifelse(cumsum(value == 0) > 0 & value == 1, NA, value)) %>%
spread(label, value)
Note that unlike the matrix-based approach, this does not preserve the ordering of the columns.
回答2:
To generate the output table I have written the following function:
library(rlang)
library(dplyr)
fill_ones <- function(df, meta) {
fail_labels <- meta[meta$weight == 0, ]$label
last_val <- NULL
for ( i in length(fail_labels):1) {
if (is.null(last_val)) last_val <- df$pass
else last_val <- eval_tidy(sym(fail_labels[[i+1]]), df)
this_name <- sym(fail_labels[[i]])
this_val <- eval_tidy(this_name, df)
this_val[intersect(which(!is.na(last_val)), which(is.na(this_val)))] <- 1
df <- mutate(df, !!!new_definition(this_name, this_val))
}
df
}
This function loops over the fail sorts defined in meta
and computes changes to the corresponding column in the data table df
.
Calls to sym(fail_labels[[i]])
lookup the name of each column and eval_tidy(..., df)
extracts the corresponding vector in the data frame.
The expression intersect(which(!is.na(last_val)), which(is.na(this_val)))
defines the subset of NA
's that will be replaced by a 1.00
.
The entire column is over-written with new values using mutate()
. To reduce the amount of quoting and unquoting I use new_definition()
rather than :=
.
I'm not convinced I've reached the simplest syntax for indirectly referring to columns in a data table. Having non-syntactic names doesn't help. Also, we only need to modify a limited number of NA
's yet this solution re-writes every data entry column by column. I haven't hit upon a good syntax to avoid this (without turning to data.table
).
If anyone has a better approach I would love to hear it.
回答3:
Following OP's approach to fill in missing 1.00 from left to right, this can be implemented using melt()
, dcast()
and rleid()
:
library(data.table)
mDT <- melt(setDT(yld_tbl), id.var = "lot")
mDT[
mDT[, grp := rleid(is.na(value)), by = lot][, .I[is.na(value) & grp > 1]]
, value := 1][
, dcast(.SD, lot ~ variable)]
lot pass fail C fail B fail A
1: lot1 NA NA 0.00 1.00
2: lot2 NA 0.00 0.80 1.00
3: lot3 0.49 1.00 0.50 0.98
4: lot4 0.70 0.95 0.74 0.99
5: lot5 0.95 0.95 1.00 1.00
Data
yld_tbl <- tibble::tribble( ~lot, ~pass, ~`fail C`, ~`fail B`, ~`fail A`,
"lot1", NA, NA, 0.00, NA,
"lot2", NA, 0.00, 0.80, NA,
"lot3", 0.49, NA, 0.50, 0.98,
"lot4", 0.70, 0.95, 0.74, 0.99,
"lot5", 0.95, 0.95, NA, NA)
Note the additional "lot5"
row.