Assigning weights to different features in R

2019-08-13 12:43发布

Is it possible to assign weights to different features before formulating a DFM in R?

Consider this example in R

str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE)

DFM mydfm looks like:

docs apple better banana
text1  1      1     1

But, I want to assign weights(apple:5, banana:3) beforehand, so that DFM mydfm looks like:

docs apple better banana
text1  5      1     3

2条回答
一夜七次
2楼-- · 2019-08-13 13:02

I don't think so, however you can easily do it afterwards:

library(quanteda)
str <- "apple is better than banana"
mydfm <- dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE)
idx <- which(names(weights) %in% colnames(mydfm))
mydfm[, names(weights)[idx]] <-  mydfm[, names(weights)[idx]] %*% diag(weights[idx])
mydfm
# 1 x 3 sparse Matrix of class "dgCMatrix"
#        features
# docs    apple better banana
#   text1     5      1      3
查看更多
smile是对你的礼貌
3楼-- · 2019-08-13 13:05

This points to the need to add an option to the weight method for dfm-class, to make this easier and more importantly not to strip the class of dfm from the sparse matrix. The dfm also has a @weights slot in the object that is designed to keep a record of how it was weighted, so this information could/should also be preserved.

@lukeA's solution drops the dfm class twice (not his or your fault but mine!!), once in the %*% and again in the <-. The first can be avoided by using column-wise recycling and a standard * instead of the matrix multiplication %*%, since I don't think a method has been written for dfm-class for %*% (which is why it defaults to the sparseMatrix method). The second cannot currently be avoided if you reassign sub-matrix elements, but can be avoided if you simply replace one dfm-class object object with another.

To make the new dfm-class object in a way that preserves the class, this would work (and here I have made the problem slightly more complex by adding a second document and another feature):

str <- c("apple is better than banana", "banana banana apple much better")
weights <- c(apple = 5, banana = 3, much = 0.5)
mydfm <- dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE)

# use name matching for indexing, sorts too, returns NA where no match is found
newweights <- weights[features(mydfm)]
# reassign 1 to non-matched NAs
newweights[is.na(newweights)] <- 1

# works because of column-wise recycling of the vector
mydfm * newweights
## Document-feature matrix of: 2 documents, 4 features.
## 2 x 4 sparse Matrix of class "dfmSparse"
##        features
## docs    apple better banana much
##   text1     5    3.0      5  0  
##   text2     1    0.5      2  0.5

One more note: I'd encourage the use dfm-class-specific methods for extracting things like the column names, e.g. features(mydfm) rather than colnames(mydfm), even though these will probably remain equivalent.

查看更多
登录 后发表回答