To start, here's example data which I'm working with:
ID BaselineScore MidScore Final Score
1 x NA NA
1 NA y NA
1 NA NA z
2 a NA NA
2 NA b NA
2 NA NA c
What I'd like to accomplish is for a given ID (ID==1,ID==2, etc.), determine which of the three scores (baseline, mid, or final) is greatest (i.e. max(x,y,z), max(a,b,c), etc.). The reason I have NAs is because I used the spread
function from tidyr (the score variables at a certain time point were originally rows under a more general score variable).
I tried used the base R pmax function, but that only works if you have 'horizontally' aligned values between columns.
Any tips?
Thanks,
Here is a base solution using apply and max and then find the max index.
df <- read.csv(text="ID,BaselineScore,MidScore,Final Score
1,1,NA,NA
1,NA,2,NA
1,NA,NA,3
2,7,NA,NA
2,NA,6,NA
2,NA,NA,5")
fun_base <- function() {
lapply(split(df, df$ID), function(x) {
tmp <- apply(x[-1], 2, max, na.rm=TRUE)
tmp[which.max(tmp)]
})
}
fun_dplyr <- function() {
df %>%
gather(Score_type, Score, -ID) %>%
group_by(ID) %>%
filter(Score==max(Score, na.rm=TRUE))
}
microbenchmark(
fun_base(),
fun_dplyr(),
times=50L)
#Unit: microseconds
# expr min lq mean median uq max neval
# fun_base() 590.6 666.6 728.842 709.85 789.1 1060.1 50
# fun_dplyr() 2110.3 2318.3 2533.324 2442.75 2639.5 3663.4 50
We can coalesce
the columns together and then get the max
by 'ID'
library(tidyverse)
df %>%
transmute(ID, newCol = coalesce(BaselineScore, MidScore, FinalScore)) %>%
group_by(ID) %>%
summarise(newCol = max(newCol))
# A tibble: 2 × 2
# ID newCol
# <int> <chr>
#1 1 z
#2 2 c
Or another option is to use pmax
and max
df %>%
transmute(ID, newCol = pmax(BaselineScore, MidScore, FinalScore, na.rm =TRUE)) %>%
group_by(ID) %>%
summarise(newCol = max(newCol))