How to build a Term-Document-Matrix from a set of

2020-05-28 18:07发布

I have two sets of data:

a set of tags (single words like php, html, etc)
a set of texts

I wish now to build a Term-Document-Matrix representing the number occurrences of the tags element in the text element.

I have looked into R library tm, and the TermDocumentMatrix function, but I do not see the possibility to specify the tags as input.

Is there a way to do that?

I am open to any tool (R, Python, other), although using R would be great.

Let's set the data as:

TagSet <- data.frame(c("c","java","php","javascript","android"))
colnames(TagSet)[1] <- "tag"

TextSet <- data.frame(c("How to check if a java file is a javascript script java blah","blah blah php"))
colnames(TextSet)[1] <- "text"

now I'd like to have the TermDocumentMatrix of TextSet according to TagSet.

I tried this:

myCorpus <- Corpus(VectorSource(TextSet$text))
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE, stopwords=TRUE))


>inspect(tdm)
A term-document matrix (7 terms, 2 documents)

Non-/sparse entries: 8/6
Sparsity           : 43%
Maximal term length: 10 
Weighting          : term frequency (tf)

            Docs
Terms        1 2
  blah       1 2
  check      1 0
  file       1 0
  java       2 0
  javascript 1 0
  php        0 1
  script     1 0

but that's checking the text against the words of the text, whereas I want to check presence of already defined tags.

标签： r term-document-matrix

2条回答

姐就是有狂的资本

2楼-- · 2020-05-28 18:30

tdm.onlytags <- tdm[rownames(tdm)%in%TagSet$tag,]

to select only your specified words and next proceed with your analysis.

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2020-05-28 18:35

DocumentTermMatrix(docs, list(dictionary = Dictionary$Var1))

You could pre-defined the dictionary using the set tags

0人赞添加讨论(0) 举报

How to build a Term-Document-Matrix from a set of

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间