I have R data frame with hundreds of rows as
word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1
I would like to group the data by patterns, say seed + seeds ... that looks like
word Freq
seed 7
contract 4
river 1
Here is potentially another way to go. In the SnowballC
package, there is a function which cleans up words and get word stems (i.e, wordStem()
). Using that, you can skip string manipulation, I think. Once you get this process done, all you do is to get sum of word frequency.
library(SnowballC)
library(dplyr)
mydf <- read.table(text = "word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1", header = T)
mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))
# word total
# (chr) (int)
#1 contract 4
#2 river 1
#3 seed 7
One option would be to create a grouping variable 'gr' by extracting substring based on the minimum number of characters in 'word', do this one more with 'word' sp that we can get the substring for each group of words, and then get the sum
of 'Freq' by 'word'.
library(dplyr)
df1 %>%
group_by(gr= substr(word, 1, min(nchar(word)))) %>%
group_by(word= substr(word, 1, min(nchar(word)))) %>%
summarise(Freq= sum(Freq))
word Freq
# (chr) (int)
#1 contract 4
#2 river 1
#3 seed 7
Can also do with cross-join, which is a little bit safer than the above method.
library(dplyr)
library(stringi)
df %>%
merge(df %>% select(short_word = word) ) %>%
filter(short_word %>%
stri_detect_regex(word, .) ) %>%
group_by(word) %>%
slice(short_word %>% stri_length %>% which.min) %>%
group_by(short_word) %>%
summarise(Freq= sum(Freq))
An attempt using adist
to match the terms up.
dat$grp <- seq(nrow(dat))
# generate a matrix comparing the vector of words to themselves
tmp <- adist(dat$word, dat$word, partial=TRUE)
diag(tmp) <- Inf
dat$grp[col(tmp)[tmp==0]] <- row(tmp)[tmp==0]
final <- aggregate(Freq ~ grp, data=dat, sum)
final$word <- dat$word[match(final$grp, dat$grp)]
# grp Freq word
#1 1 7 seed
#2 3 4 contract
#3 5 1 river
Data used:
dat <- data.frame(word=c("seed","seeds","contract","contracting","river"),Freq=c(4,3,2,2,1))