R How do i keep punctuation with TermDocumentMatri

I have a large dataframe where I am identifying patterns in strings and then extracting them. I have provided a small subset to illustrate my task. I am generating my patterns by creating a TermDocumentMatrix with multiple words. I use these patterns with stri_extract and str_replace from stringi and stringr packages to search within the 'punct_prob' dataframe.

My problem is that I need to keep punctuation in tact within the 'punct_prob$description' to maintain the literal meanings within each string. For example, I can't have 2.35 mm becoming 235mm. The TermDocumentMatrix procedure I am using however is removing punctuation (or at least the periods) and thus my pattern seeking functions can't match them.

In short... how do i keep the punctuation when generating the TDM? I have tried including removePunctuation=FALSE within the TermDocumentMatrix control argument but with no success.

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                    "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                    "TITANIUM LINE POWER P. B F.O. TRIP SPR",
                                    "MEDESY SPECIAL ITEM")))

punct_prob$description = as.character(punct_prob$description)

# a control for the number of words in phrases
max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

#set up ngrams and tdm
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = max_ngram, max = max_ngram))}
punct_prob_corpus = Corpus(VectorSource(punct_prob$description))
punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = BigramTokenizer, removePunctuation=FALSE))
inspect(punct_prob_tdm)

inspect results - with no punctuation....

                                   Docs
Terms                              1 2 3 4
  angle head 2 1 for 2 35mm bur    1 0 0 0
  contra angle head 2 1 for 2 35mm 1 0 0 0
  line mini p b f o trip spray     0 1 0 0
  line power p b f o trip spr      0 0 1 0
  titanium line mini p b f o trip  0 1 0 0
  titanium line power p b f o trip 0 0 1 0

Thanks for any help in advance :)

标签： r tm punctuation term-document-matrix

2条回答

霸刀☆藐视天下

2楼-- · 2019-05-24 11:20

The quanteda package is smart enough to tokenise without treating intra-word punctuation characters as "punctuation". This makes constructing your matrix very easy:

txt <- c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
         "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
         "TITANIUM LINE POWER P.B F.O. TRIP SPR",
         "MEDESY SPECIAL ITEM")

require(quanteda)
myDfm <- dfm(txt, ngrams = 6:8, concatenator = " ")
t(myDfm)
#                                        docs
# features                                text1 text2 text3 text4
#   contra angle head for 2.35mm bur          1     0     0     0
#   titanium line mini p.b f.o trip           0     1     0     0
#   line mini p.b f.o trip spray              0     1     0     0
#   titanium line mini p.b f.o trip spray     0     1     0     0
#   titanium line power p.b f.o trip          0     0     1     0
#   line power p.b f.o trip spr               0     0     1     0
#   titanium line power p.b f.o trip spr      0     0     1     0

If you want to preserve "punctuation", it will be tokenised as a separate token when it ends a term:

myDfm2 <- dfm(txt, ngrams = 8, concatenator = " ", removePunct = FALSE)
t(myDfm2)
#                                          docs
# features                                  text1 text2 text3 text4
#   titanium line mini p.b f.o . trip spray     0     1     0     0
#   titanium line power p.b f.o . trip spr      0     0     1     0

Note here that the ngrams argument is completely flexible and can take a vector of ngram sizes, as in the first example where ngrams = 6:8 indicates that it should form 6-, 7-, and 8-grams.

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2019-05-24 11:32

The issue is not so much the termdocumentmatrix, but the ngram tokenizer based on RWEKA. Rweka removes punctuations when doing the tokenizing.

If you use the nlp tokenizer it keeps the punctuation. See code below.

P.S. I removed one space in your 3rd textline so P. B. is P.B. like it is on line 2.

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                                "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                                "TITANIUM LINE POWER P.B F.O. TRIP SPR",
                                                "MEDESY SPECIAL ITEM")))
punct_prob$description = as.character(punct_prob$description)

max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

punct_prob_corpus = Corpus(VectorSource(punct_prob$description))




NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), max_ngram), paste, collapse = " "), use.names = FALSE)
}


punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = NLPBigramTokenizer))
inspect(punct_prob_tdm)

<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 38
Weighting          : term frequency (tf)

                                        Docs
Terms                                    1 2 3 4
  contra angle head 2:1 for 2.35mm bur   1 0 0 0
  titanium line mini p.b f.o. trip spray 0 1 0 0
  titanium line power p.b f.o. trip spr  0 0 1 0

0人赞添加讨论(0) 举报

R How do i keep punctuation with TermDocumentMatri

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间