Sample a single row, per column, within a subset o

2019-08-21 03:27发布

问题:

As an example of my data, I have GROUP 1 with three rows of data, and GROUP 2 with two rows of data, in a data frame:

GROUP   VARIABLE 1   VARIABLE 2   VARIABLE 3 
    1            2            6            5 
    1            4           NA            1 
    1           NA            3            8
    2            1           NA            2      
    2            9           NA           NA 

I would like to sample a single variable, per column from GROUP 1, to make a new row representing GROUP 1. I do not want to sample one single and complete row from GROUP 1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP 2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP 2, VARIABLE 2, above).

For example, after sampling, I could have as a result:

GROUP   VARIABLE 1   VARIABLE 2   VARIABLE 3 
    1            4            6            1 
    2            9           NA            2 

Only GROUP 2, VARIABLE 2, can result in NA here. I actually have 39 groups, 50,000+ variables, and a substantial number of NA. I would sincerely appreciate the code to make a new data frame of rows, each row having the sampling results per group.

回答1:

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'GROUP', we loop through the columns (lapply(.SD,), if all of the elements are NA we return NA or else we get the sample of non-NA elements.

library(data.table)
setDT(df1)[,lapply(.SD, function(x)
     if(all(is.na(x))) NA_integer_ else sample(na.omit(x),1)) , by = GROUP]


回答2:

To ignore NAs just pass one more argument to the summarize function na.rm = TRUE. it will ignore all the NAs.

I used dplyr to perform the requested grouping but you can use base function also. dplyr is easy to use and read.

below is the code

if the summarise function is same for all columns you can use summarise_each and do the grouping in one go.

library(dplyr)

    df = df %>%
      group_by(Group) %>%
      summarise(Var_1 = max(Var_1, na.rm = TRUE),Var_2 = max(Var_2, na.rm = TRUE),Var_3 = min(Var_3, na.rm = TRUE))