Count number of times a word-wildcard appears in t

2019-08-08 03:08发布

I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to:

1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1).

2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2).

I'm able to achieve (1), but not (2). Can anyone please help? thanks.

text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")

# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)

# Result:
#      docs    word.count    activation    activated    activat*
# 1   doc 1             5     1(20.00%)    1(20.00%)           0

2楼-- · 2019-08-08 03:50

Is it possible that this might have to do something with the versions? I ran the exact same code (see below) and got what you expected

    > text <- "activation has begunm system activated"
    > text <- Corpus(VectorSource(text))
    > words <- c("activation", "activated", "activat")
    > apply_as_df(text, termco, match.list=words)
       docs word.count activation activated   activat
    1 doc 1          5  1(20.00%) 1(20.00%) 2(40.00%)

Below is the output when I run R.version(). I am running this in RStudio Version 0.99.491 on Windows 10.

    > R.Version()

    [1] "x86_64-w64-mingw32"

    [1] "x86_64"

    [1] "mingw32"

    [1] "x86_64, mingw32"

    [1] ""

    [1] "3"

    [1] "2.3"

    [1] "2015"

    [1] "12"

    [1] "10"

    $`svn rev`
    [1] "69752"

    [1] "R"

    [1] "R version 3.2.3 (2015-12-10)"

    [1] "Wooden Christmas-Tree"

Hope this helps

3楼-- · 2019-08-08 04:04

Maybe consider different approach using library stringi?

text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")

counts <- unlist(lapply(words,function(word)
  newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
  stri_count_regex(text, newWord)

ratios <- counts/stri_count_words(text)
names(ratios) <- words

Result is:

activation  activated   activat* 
0.2         0.2        0.4 

In the code I convert * into \p{L} which means any letter in regex pattern. After that I count found regex occurences.

登录 后发表回答