I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA
Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).
I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.
So far, I've tried the following with the most success over other attempts:
for (in in 1:length(risk_codes){
df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}
It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.
Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.
The solution would look like this
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol
data data data data J123 F456 H789 E468 1
data data data data T452 NA NA NA 0
if my risk_codes contained J12, F4, T543, for example.
We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with
sapply
andany
.So, we can drop the for loop and your code becomes like this:
The result is a logical vector.
If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:
The result will be: