Group data frame by pattern in R

2019-01-25 23:10发布

问题:

I have R data frame with hundreds of rows as

word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1

I would like to group the data by patterns, say seed + seeds ... that looks like

word     Freq
seed      7
contract  4
river     1

回答1:

Here is potentially another way to go. In the SnowballC package, there is a function which cleans up words and get word stems (i.e, wordStem()). Using that, you can skip string manipulation, I think. Once you get this process done, all you do is to get sum of word frequency.

library(SnowballC)
library(dplyr)

mydf <- read.table(text = "word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1", header = T)

mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))

#      word total
#     (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7


回答2:

One option would be to create a grouping variable 'gr' by extracting substring based on the minimum number of characters in 'word', do this one more with 'word' sp that we can get the substring for each group of words, and then get the sum of 'Freq' by 'word'.

library(dplyr)
 df1 %>% 
    group_by(gr= substr(word, 1, min(nchar(word)))) %>%
    group_by(word= substr(word, 1, min(nchar(word)))) %>%
    summarise(Freq= sum(Freq)) 
    word  Freq
#      (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7


回答3:

Can also do with cross-join, which is a little bit safer than the above method.

library(dplyr)
library(stringi)

df %>%
  merge(df %>% select(short_word = word) ) %>%
  filter(short_word %>%
           stri_detect_regex(word, .) ) %>%
  group_by(word) %>%
  slice(short_word %>% stri_length %>% which.min) %>%
  group_by(short_word) %>%
  summarise(Freq= sum(Freq)) 


回答4:

An attempt using adist to match the terms up.

dat$grp <- seq(nrow(dat))

# generate a matrix comparing the vector of words to themselves
tmp <- adist(dat$word, dat$word, partial=TRUE)
diag(tmp) <- Inf
dat$grp[col(tmp)[tmp==0]] <- row(tmp)[tmp==0]

final <- aggregate(Freq ~ grp, data=dat, sum)
final$word <- dat$word[match(final$grp, dat$grp)]

#  grp Freq     word
#1   1    7     seed
#2   3    4 contract
#3   5    1    river

Data used:

dat <- data.frame(word=c("seed","seeds","contract","contracting","river"),Freq=c(4,3,2,2,1))