I have a dataset with a patient identifier and a text field with a summary of medical findings (1 row per patient). I would like to create a dataset with multiple rows per patients by splitting the text field so that each sentence of the summary falls on a different line. Subsequently, I would like to text parse each line looking for certain keywords and negation terms. An example of the structure of the data frame is (the letters represent the sentences):
ID Summary
1 aaaaa. bb. c
2 d. eee. ff. g. h
3 i. j
4 k
I would like to split the text field at the “.” to convert it to:
ID Summary
1 aaaaa
1 bb
1 c
2 d
2 eee
2 ff
2 g
2 h
3 i
3 j
4 k
R code to create the initial data frame:
ID <- c(1, 2, 3, 4)
Summary <- c("aaaaa. bb. c", "d. eee. ff. g. h", "i. j", "k")
df <- data.frame(cbind(ID, Summary))
df$ID <- as.numeric(df$ID)
df$Summary <- as.character(df$Summary)
The following previous posting provides a nice solution: Breaking up (melting) text data in a column in R?
I used the following code from that posting which works for this sample dataset:
dflong <- by(df, df$ID, FUN = function(x) {
sentence = unlist(strsplit(x$Summary, "[.]"))
data.frame(ID = x$ID, Summary = sentence)
})
dflong2<- do.call(rbind,dflong)
However, when I try to apply to my larger dataset (>200,000 rows), I get the error message:
Error in data.frame(ID = x$ID, Summary = sentence) : arguments imply differing number of rows: 1, 0
I reduced the data frame down to test it on a smaller dataset and I still get this error message any time the number of rows is >57.
Is there another approach to take that can handle a larger number of rows? Any advice is appreciated. Thank you.