Modify certain values in a data frame by indirect

2019-07-09 04:50发布

I'm wrangling some data where we sort fails into bins and compute limited yields for each sort bin by lot.

I have a meta table that describes the sort bins. The rows are arranged in ascending test order and some of the sort labels come in with non-syntactic names.

sort_tbl <- tibble::tribble(~weight,   ~label,
                                  0, "fail A",
                                  0, "fail B",
                                  0, "fail C",
                                100,   "pass")
> sort_tbl
# A tibble: 4 x 2
  weight  label
   <dbl>  <chr>
1      0 fail A
2      0 fail B
3      0 fail C
4    100   pass

I have a data table of limited yield by sort bin with one row per lot and one col for each sort bin. Because this table was constructed from a transposition we get instances where a particular sort never occurred for a lot and the resulting value is NA. Note that the columns in this table are arranged in descending test order.

yld_tbl <- tibble::tribble(  ~lot, ~pass, ~`fail C`, ~`fail B`, ~`fail A`,
                           "lot1",    NA,        NA,      0.00,        NA,
                           "lot2",    NA,      0.00,      0.80,        NA,
                           "lot3",  0.49,        NA,      0.50,      0.98,
                           "lot4",  0.70,      0.95,      0.74,      0.99)
> yld_tbl
# A tibble: 4 x 5
    lot  pass `fail C` `fail B` `fail A`
  <chr> <dbl>    <dbl>    <dbl>    <dbl>
1  lot1    NA       NA     0.00       NA
2  lot2    NA     0.00     0.80       NA
3  lot3  0.49       NA     0.50     0.98
4  lot4  0.70     0.95     0.74     0.99

Some of the missing values imply a limited yield of 100% while others reflect an undefined value because we are zero yield earlier in the flow. My task is to replace the former group of NA's with 1.00 as appropriate.

One algorithm to accomplish this works left to right (descending test order) replacing NA with 1.00 if the subsequent limited yield is not NA. In the first row of the example data set, we don't change fail C since pass is missing. But we do replace fail A with 1.00 since fail B is not missing.

The correct example output would be:

> fill_ones(yld_tbl, sort_tbl)
# A tibble: 4 x 5
    lot  pass `fail C` `fail B` `fail A`
  <chr> <dbl>    <dbl>    <dbl>    <dbl>
1  lot1    NA       NA     0.00     1.00
2  lot2    NA     0.00     0.80     1.00
3  lot3  0.49     1.00     0.50     0.98
4  lot4  0.70     0.95     0.74     0.99

3条回答
爷的心禁止访问
2楼-- · 2019-07-09 05:08

Following OP's approach to fill in missing 1.00 from left to right, this can be implemented using melt(), dcast() and rleid():

library(data.table)
mDT <- melt(setDT(yld_tbl), id.var = "lot")
mDT[
  mDT[, grp := rleid(is.na(value)), by = lot][, .I[is.na(value) & grp > 1]]
  , value := 1][
    , dcast(.SD, lot ~ variable)]
    lot pass fail C fail B fail A
1: lot1   NA     NA   0.00   1.00
2: lot2   NA   0.00   0.80   1.00
3: lot3 0.49   1.00   0.50   0.98
4: lot4 0.70   0.95   0.74   0.99
5: lot5 0.95   0.95   1.00   1.00

Data

yld_tbl <- tibble::tribble(  ~lot, ~pass, ~`fail C`, ~`fail B`, ~`fail A`,
                             "lot1",    NA,        NA,      0.00,        NA,
                             "lot2",    NA,      0.00,      0.80,        NA,
                             "lot3",  0.49,        NA,      0.50,      0.98,
                             "lot4",  0.70,      0.95,      0.74,      0.99,
                             "lot5",  0.95,      0.95,        NA,        NA)

Note the additional "lot5" row.

查看更多
干净又极端
3楼-- · 2019-07-09 05:16

This problem becomes a bit easier if you think of it as "first replace all the NAs with 1, then replace all 1s after the first 0 with NA."

Here are two approaches, one using matrix operations and one using dplyr.


In the matrix approach, you'd extract the values as a numeric matrix, use apply to find the positions that need to be replaced with NA, and return them.

# extract as a matrix, with left-to-right bins
m <- as.matrix(yld_tbl[, sort_tbl$label])

# replace NAs with 1
m[is.na(m)] <- 1

# find 1s happening after a zero in each row
after_zero <- t(apply(m == 0, 1, cumsum)) & (m == 1)

# replace them with NA
m[after_zero] <- NA

# return them in the table
yld_tbl[, sort_tbl$label] <- m

Using dplyr/tidyr, you'd first gather() the columns (using arrange() to put them in the desired order), replace the NAs (the group_by/mutate is accomplishing the same thing as apply above), and spread them back into a wide format.

library(dplyr)
library(tidyr)

yld_tbl %>%
  gather(label, value, -lot) %>%
  arrange(lot, match(label, sort_tbl$label)) %>%
  replace_na(list(value = 1)) %>%
  group_by(lot) %>%
  mutate(value = ifelse(cumsum(value == 0) > 0 & value == 1, NA, value)) %>%
  spread(label, value)

Note that unlike the matrix-based approach, this does not preserve the ordering of the columns.

查看更多
萌系小妹纸
4楼-- · 2019-07-09 05:29

To generate the output table I have written the following function:

library(rlang)
library(dplyr)

fill_ones <- function(df, meta) {
  fail_labels <- meta[meta$weight == 0, ]$label
  last_val <- NULL
  for ( i in length(fail_labels):1) {
    if (is.null(last_val)) last_val <- df$pass
    else last_val <- eval_tidy(sym(fail_labels[[i+1]]), df)
    this_name <- sym(fail_labels[[i]])
    this_val  <- eval_tidy(this_name, df)
    this_val[intersect(which(!is.na(last_val)), which(is.na(this_val)))] <- 1
    df <- mutate(df, !!!new_definition(this_name, this_val))
  }
  df
}

This function loops over the fail sorts defined in meta and computes changes to the corresponding column in the data table df.

Calls to sym(fail_labels[[i]]) lookup the name of each column and eval_tidy(..., df) extracts the corresponding vector in the data frame.

The expression intersect(which(!is.na(last_val)), which(is.na(this_val))) defines the subset of NA's that will be replaced by a 1.00.

The entire column is over-written with new values using mutate(). To reduce the amount of quoting and unquoting I use new_definition() rather than :=.

I'm not convinced I've reached the simplest syntax for indirectly referring to columns in a data table. Having non-syntactic names doesn't help. Also, we only need to modify a limited number of NA's yet this solution re-writes every data entry column by column. I haven't hit upon a good syntax to avoid this (without turning to data.table).

If anyone has a better approach I would love to hear it.

查看更多
登录 后发表回答