R Dataframe: aggregating strings within column, ac

2019-01-20 01:59发布

问题:

I have what seems like a very inefficient solution to a peculiar problem. I have text data which, for various reasons, is broken across rows of a dataframe at random intervals. However, certain subsets of are known to belong together based on unique combinations of other variables in the dataframe. See, for example, a MWE demonstrating the structure and my initial solution:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

In this example as in my real case, "passage" is the unique grouping variable, so it is not entirely necessary to take the other pieces along with it, although I'd like them available in my dataset.

My current estimates are that this procedure I have devise will take several hours for a dataset that is otherwise easily handled by R on my computer. Perhaps there are some efficiencies to be gained either by other functions or packages, or not creating and removing so many objects?

Thanks for any help here!

回答1:

Here are two ways:

base R

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))


回答2:

data.table Here's one way:

require(data.table)
DT <- data.table(df)

DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
#    page passage person               concat
# 1:    1     123      A hello my name is guy
# 2:    1     124      B       well hello guy

Putting the extra variables (besides passage) in the by doesn't cost much, I think.


dplyr The analogue is

df %>% 
  group_by(page,passage,person) %>% 
  summarise(concat=paste0(text,collapse=" "))

# Source: local data frame [2 x 4]
# Groups: page, passage, person
# 
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 2    1     124      B       well hello guy

base R One way is:

df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 6    1     124      B       well hello guy