I'm wrangling some data where we sort fails into bins and compute limited yields for each sort bin by lot.
I have a meta table that describes the sort bins. The rows are arranged in ascending test order and some of the sort labels come in with non-syntactic names.
sort_tbl <- tibble::tribble(~weight, ~label,
0, "fail A",
0, "fail B",
0, "fail C",
100, "pass")
> sort_tbl
# A tibble: 4 x 2
weight label
<dbl> <chr>
1 0 fail A
2 0 fail B
3 0 fail C
4 100 pass
I have a data table of limited yield by sort bin with one row per lot and one col for each sort bin. Because this table was constructed from a transposition we get instances where a particular sort never occurred for a lot and the resulting value is NA
. Note that the columns in this table are arranged in descending test order.
yld_tbl <- tibble::tribble( ~lot, ~pass, ~`fail C`, ~`fail B`, ~`fail A`,
"lot1", NA, NA, 0.00, NA,
"lot2", NA, 0.00, 0.80, NA,
"lot3", 0.49, NA, 0.50, 0.98,
"lot4", 0.70, 0.95, 0.74, 0.99)
> yld_tbl
# A tibble: 4 x 5
lot pass `fail C` `fail B` `fail A`
<chr> <dbl> <dbl> <dbl> <dbl>
1 lot1 NA NA 0.00 NA
2 lot2 NA 0.00 0.80 NA
3 lot3 0.49 NA 0.50 0.98
4 lot4 0.70 0.95 0.74 0.99
Some of the missing values imply a limited yield of 100% while others reflect an undefined value because we are zero yield earlier in the flow. My task is to replace the former group of NA
's with 1.00
as appropriate.
One algorithm to accomplish this works left to right (descending test order) replacing NA
with 1.00
if the subsequent limited yield is not NA
. In the first row of the example data set, we don't change fail C
since pass
is missing. But we do replace fail A
with 1.00
since fail B
is not missing.
The correct example output would be:
> fill_ones(yld_tbl, sort_tbl)
# A tibble: 4 x 5
lot pass `fail C` `fail B` `fail A`
<chr> <dbl> <dbl> <dbl> <dbl>
1 lot1 NA NA 0.00 1.00
2 lot2 NA 0.00 0.80 1.00
3 lot3 0.49 1.00 0.50 0.98
4 lot4 0.70 0.95 0.74 0.99
Following OP's approach to fill in missing 1.00 from left to right, this can be implemented using
melt()
,dcast()
andrleid()
:Data
Note the additional
"lot5"
row.This problem becomes a bit easier if you think of it as "first replace all the NAs with 1, then replace all 1s after the first 0 with NA."
Here are two approaches, one using matrix operations and one using dplyr.
In the matrix approach, you'd extract the values as a numeric matrix, use
apply
to find the positions that need to be replaced with NA, and return them.Using dplyr/tidyr, you'd first
gather()
the columns (usingarrange()
to put them in the desired order), replace the NAs (thegroup_by
/mutate
is accomplishing the same thing asapply
above), andspread
them back into a wide format.Note that unlike the matrix-based approach, this does not preserve the ordering of the columns.
To generate the output table I have written the following function:
This function loops over the fail sorts defined in
meta
and computes changes to the corresponding column in the data tabledf
.Calls to
sym(fail_labels[[i]])
lookup the name of each column andeval_tidy(..., df)
extracts the corresponding vector in the data frame.The expression
intersect(which(!is.na(last_val)), which(is.na(this_val)))
defines the subset ofNA
's that will be replaced by a1.00
.The entire column is over-written with new values using
mutate()
. To reduce the amount of quoting and unquoting I usenew_definition()
rather than:=
.I'm not convinced I've reached the simplest syntax for indirectly referring to columns in a data table. Having non-syntactic names doesn't help. Also, we only need to modify a limited number of
NA
's yet this solution re-writes every data entry column by column. I haven't hit upon a good syntax to avoid this (without turning todata.table
).If anyone has a better approach I would love to hear it.