When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could use machine learning method to calculate the words semantic distance", when applying this document on the dictionary["semantic distance", "machine learning"], it will return a 1x2 matrix:[semantic distance, 1;machine learning,1]
问题:
回答1:
It's possible to do this with quanteda, although it requires the construction of a dictionary for each phrase, and then pre-processing the text to convert the phrases into tokens. To become a "token", the phrases need to be joined by something other than whitespace -- here, the "_
" character.
Here are some example texts, including the phrase in the OP. I added two additional texts for the illustration -- below, the first row of the document-feature matrix produces the requested answer.
txt <- c("We could use machine learning method to calculate the words semantic distance.",
"Machine learning is the best sort of learning.",
"The distance between semantic distance and machine learning is machine driven.")
The current signature for phrase to token requires the phrases
argument to be a dictionary or a collocations object. Here we will make it a dictionary:
mydict <- dictionary(list(machine_learning = "machine learning",
semantic_distance = "semantic distance"))
Then we pre-process the text to convert the dictionary phrases to their keys:
toks <- tokens(txt) %>%
tokens_compound(mydict)
toks
# tokens from 3 documents.
# text1 :
# [1] "We" "could" "use" "machine_learning"
# [5] "method" "to" "calculate" "the"
# [9] "words" "semantic_distance" "."
#
# text2 :
# [1] "Machine_learning" "is" "the" "best"
# [5] "sort" "of" "learning" "."
#
# text3 :
# [1] "The" "distance" "between" "semantic_distance"
# [5] "and" "machine_learning" "is" "machine"
# [9] "driven" "."
Finally, we can construct the document-feature matrix, keeping all phrases using the default "glob" pattern match for any feature that includes the underscore character:
mydfm <- dfm(toks, select = "*_*")
mydfm
## Document-feature matrix of: 3 documents, 2 features.
## 3 x 2 sparse Matrix of class "dfm"
## features
## docs machine_learning semantic_distance
## text1 1 1
## text2 1 0
## text3 1 1
(Answer updated for >= v0.9.9)