How to calculate proximity of words to a specific

2019-02-19 17:31发布

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text:

song <- "Far over the misty mountains cold To dungeons deep and caverns old We 
must away ere break of day To seek the pale enchanted gold. The dwarves of 
yore made mighty spells, While hammers fell like ringing bells In places deep, 
where dark things sleep, In hollow halls beneath the fells. For ancient king 
and elvish lord There many a gleaming golden hoard They shaped and wrought, 
and light they caught To hide in gems on hilt of sword. On silver necklaces 
they strung The flowering stars, on crowns they hung The dragon-fire, in 
twisted wire They meshed the light of moon and sun. Far over the misty 
mountains cold To dungeons deep and caverns old We must away, ere break of 
day, To claim our long-forgotten gold. Goblets they carved there for 
themselves And harps of gold; where no man delves There lay they long, and 
many a song Was sung unheard by men or elves. The pines were roaring on the 
height, The winds were moaning in the night. The fire was red, it flaming 
spread; The trees like torches blazed with light. The bells were ringing in 
the dale And men they looked up with faces pale; The dragon’s ire more fierce 
than fire Laid low their towers and houses frail. The mountain smoked beneath 
the moon; The dwarves they heard the tramp of doom. They fled their hall to 
dying fall Beneath his feet, beneath the moon. Far over the misty mountains 
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"

I want to be able to see what words appear within 15 (I would like this number to be interchangeable) words on either side (15 to the left and 15 to the right) of the word "fire" (also interchangeable) every time it appears. I want to see each word and the number of times it appears in this 15 word span for each instance of "fire." So, for example, "fire" is used 3 times. Of those 3 times the word "light" falls within 15 words on either side twice. I would want a table that shows the word, the number of times it appears within the specified proximity of 15, the maximum distance (which in this case is 12), the minimum distance (which is 7), and the average distance (which is 9.5).

I figured I would need several steps and packages to make this work. My first thought was to use the "kwic" function from quanteda since it allows you to choose a "window" around a specific term. Then a frequency count of terms based on the kwic results is not that hard (with stopwords removed for the frequency, but not for the word proximity measure). My real problem is finding the maximum, minimum, and average distances from the focus term and then getting the results into a nice neat table with the terms as rows in descending order by frequency and the columns giving me the frequency count, max distance, minimum distance, and average distance.

Here is what I have so far:

library(quanteda)
library(tm)

mysong <- char_tolower(song)

toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE, 
remove_numbers = TRUE, remove_symbols = TRUE)

mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)

thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))

kwicFreq <- termFreq(thekwic)

Any help is much appreciated.

标签: r tm quanteda
2条回答
Juvenile、少年°
2楼-- · 2019-02-19 17:43

The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic() but rather fcm() (feature co-occurrence matrix).

require(quanteda)

# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)

# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
##            features
## features    fire
##   Far          1
##   over         1
##   the          5
##   misty        1
##   mountains    0
##   cold         0

head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
##         features
## features fire
##    light    2

To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:

# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
    fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
##         features
##    light  9.5
## features fire

Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)

查看更多
ゆ 、 Hurt°
3楼-- · 2019-02-19 18:05

I'd suggest solving this with a combination of my tidytext and fuzzyjoin packages.

You can start by tokenizing it into a one-row-per-word data frame, adding a position column, and removing stopwords:

library(tidytext)
library(dplyr)

all_words <- data_frame(text = song) %>%
  unnest_tokens(word, text) %>%
  mutate(position = row_number()) %>%
  filter(!word %in% tm::stopwords("en"))

You can then find just the word fire, and use difference_inner_join() from fuzzyjoin to find all rows within 15 words of those rows. You can then use group_by() and summarize() to get your desired statistics for each word.

library(fuzzyjoin)

nearby_words <- all_words %>%
  filter(word == "fire") %>%
  select(focus_term = word, focus_position = position) %>%
  difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
  mutate(distance = abs(focus_position - position))

words_summarized <- nearby_words %>%
  group_by(word) %>%
  summarize(number = n(),
            maximum_distance = max(distance),
            minimum_distance = min(distance),
            average_distance = mean(distance)) %>%
  arrange(desc(number))

Output in this case:

# A tibble: 49 × 5
       word number maximum_distance minimum_distance average_distance
      <chr>  <int>            <dbl>            <dbl>            <dbl>
 1     fire      3                0                0              0.0
 2    light      2               12                7              9.5
 3     moon      2               13                9             11.0
 4    bells      1               14               14             14.0
 5  beneath      1               11               11             11.0
 6   blazed      1               10               10             10.0
 7   crowns      1                5                5              5.0
 8     dale      1               15               15             15.0
 9   dragon      1                1                1              1.0
10 dragon’s      1                5                5              5.0
# ... with 39 more rows

Note that this approach also lets you perform the analysis on multiple focus words at once. All you'd have to do is change filter(word == "fire") to filter(word %in% c("fire", "otherword")), and change group_by(word) to group_by(focus_word, word).

查看更多
登录 后发表回答