Finding maximum values of a score for a given subj

2019-09-17 11:41发布

问题:

To start, here's example data which I'm working with:

ID BaselineScore MidScore Final Score
1  x             NA       NA 
1  NA            y        NA
1  NA            NA       z 
2  a             NA       NA 
2  NA            b        NA
2  NA            NA       c

What I'd like to accomplish is for a given ID (ID==1,ID==2, etc.), determine which of the three scores (baseline, mid, or final) is greatest (i.e. max(x,y,z), max(a,b,c), etc.). The reason I have NAs is because I used the spread function from tidyr (the score variables at a certain time point were originally rows under a more general score variable).

I tried used the base R pmax function, but that only works if you have 'horizontally' aligned values between columns.

Any tips?

Thanks,

回答1:

Here is a base solution using apply and max and then find the max index.

df <- read.csv(text="ID,BaselineScore,MidScore,Final Score
1,1,NA,NA
1,NA,2,NA
1,NA,NA,3
2,7,NA,NA
2,NA,6,NA
2,NA,NA,5")

fun_base <- function() {
    lapply(split(df, df$ID), function(x) {
        tmp <- apply(x[-1], 2, max, na.rm=TRUE)
        tmp[which.max(tmp)]
    })
}

fun_dplyr <- function() {
    df %>% 
        gather(Score_type, Score, -ID) %>% 
        group_by(ID) %>% 
        filter(Score==max(Score, na.rm=TRUE))
}

microbenchmark(
    fun_base(),
    fun_dplyr(),
    times=50L)

#Unit: microseconds
#        expr    min     lq     mean  median     uq    max neval
#  fun_base()  590.6  666.6  728.842  709.85  789.1 1060.1    50
# fun_dplyr() 2110.3 2318.3 2533.324 2442.75 2639.5 3663.4    50


回答2:

We can coalesce the columns together and then get the max by 'ID'

library(tidyverse)
df %>%
   transmute(ID, newCol = coalesce(BaselineScore, MidScore, FinalScore)) %>% 
   group_by(ID) %>%
   summarise(newCol = max(newCol))
# A tibble: 2 × 2
#      ID newCol
#   <int>  <chr>
#1     1      z
#2     2      c

Or another option is to use pmax and max

df %>% 
 transmute(ID, newCol = pmax(BaselineScore, MidScore, FinalScore, na.rm =TRUE)) %>% 
 group_by(ID) %>% 
 summarise(newCol = max(newCol))


标签: r dplyr tidyr