Convert one row to multiple rows per subject in a

2020-08-04 09:38发布

问题:

I have a dataset with a patient identifier and a text field with a summary of medical findings (1 row per patient). I would like to create a dataset with multiple rows per patients by splitting the text field so that each sentence of the summary falls on a different line. Subsequently, I would like to text parse each line looking for certain keywords and negation terms. An example of the structure of the data frame is (the letters represent the sentences):

ID Summary
1 aaaaa. bb. c
2 d. eee. ff. g. h
3 i. j
4 k

I would like to split the text field at the “.” to convert it to:

ID Summary
1 aaaaa
1 bb
1 c
2 d
2 eee
2 ff
2 g
2 h
3 i
3 j
4 k

R code to create the initial data frame:

ID <- c(1, 2, 3, 4)  
Summary <- c("aaaaa. bb. c", "d. eee. ff. g. h", "i. j", "k")  

df <- data.frame(cbind(ID, Summary))  
df$ID <- as.numeric(df$ID)  
df$Summary <- as.character(df$Summary)  

The following previous posting provides a nice solution: Breaking up (melting) text data in a column in R?

I used the following code from that posting which works for this sample dataset:

dflong <- by(df, df$ID, FUN = function(x) {  
  sentence = unlist(strsplit(x$Summary, "[.]"))  
  data.frame(ID = x$ID, Summary = sentence)  
  })  
dflong2<- do.call(rbind,dflong)  

However, when I try to apply to my larger dataset (>200,000 rows), I get the error message:
Error in data.frame(ID = x$ID, Summary = sentence) : arguments imply differing number of rows: 1, 0

I reduced the data frame down to test it on a smaller dataset and I still get this error message any time the number of rows is >57.

Is there another approach to take that can handle a larger number of rows? Any advice is appreciated. Thank you.

回答1:

Use data.table:

library(data.table)
dt = data.table(df)

dt[, strsplit(Summary, ". ", fixed = T), by = ID]
#    ID    V1
# 1:  1 aaaaa
# 2:  1    bb
# 3:  1     c
# 4:  2     d
# 5:  2   eee
# 6:  2    ff
# 7:  2     g
# 8:  2     h
# 9:  3     i
#10:  3     j
#11:  4     k

There are many ways to address @agstudy's comment about empty Summary, but here's a fun one:

dt[, c(tmp = "", # doesn't matter what you put here, will delete in a sec
                 # the point of having this is to force the size of the output table
                 # which data.table will kindly fill with NA's for us
       Summary = strsplit(Summary, ". ", fixed = T)), by = ID][,
       tmp := NULL]


回答2:

You get an error because for some rows you have no data ( summary column). Try this should work for you:

   dflong <- by(df, df$ID, FUN = function(x) {  
      sentence = unlist(strsplit(x$Summary, "[.]"))  
      ## I just added this line to your solution
      if(length(sentence )==0)
           sentence <- NA
      data.frame(ID = x$ID, Summary = sentence)  
    })  
   dflong2<- do.call(rbind,dflong)  

PS : This is slightly different from the data.table solution which will remove rows where summary equal to ''(0 charcaters). That's said I would would use a data.table solution here since you have more than 200 000 rows.