R - return boolean if any strings in a vector appe

2019-09-14 15:05发布

问题:

I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA

Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).

I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.

So far, I've tried the following with the most success over other attempts:

for (in in 1:length(risk_codes){
    df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}

It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.

Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.

The solution would look like this

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20   newcol
data   data   data   data   J123    F456    H789       E468      1
data   data   data   data   T452    NA      NA         NA        0

if my risk_codes contained J12, F4, T543, for example.

回答1:

We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply and any.

So, we can drop the for loop and your code becomes like this:

my_df <- read.table(text="Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3  Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA", header=TRUE)

risk_codes <- c("F456", "XXX") # test codes

my_df$newcol <- apply(my_df,1,function(x) 
                                  any(sapply(risk_codes, 
                                              function(codes) grepl(codes,
                                                              x[c(5:24)]))))

The result is a logical vector.

If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:

my_df$new_col <- ifelse(my_df$newcol, 1, 0)

The result will be:

> my_df
  Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data   J123   F456   H789    E468      1
2 data data data data   T452   <NA>   <NA>    <NA>      0