Is it possible to assign weights to different features before formulating a DFM in R?
Consider this example in R
str="apple is better than banana"
mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE)
DFM mydfm looks like:
docs apple better banana
text1 1 1 1
But, I want to assign weights(apple:5, banana:3) beforehand, so that DFM mydfm looks like:
docs apple better banana
text1 5 1 3
I don't think so, however you can easily do it afterwards:
library(quanteda)
str <- "apple is better than banana"
mydfm <- dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE)
idx <- which(names(weights) %in% colnames(mydfm))
mydfm[, names(weights)[idx]] <- mydfm[, names(weights)[idx]] %*% diag(weights[idx])
mydfm
# 1 x 3 sparse Matrix of class "dgCMatrix"
# features
# docs apple better banana
# text1 5 1 3
This points to the need to add an option to the weight
method for dfm-class, to make this easier and more importantly not to strip the class of dfm from the sparse matrix. The dfm also has a @weights
slot in the object that is designed to keep a record of how it was weighted, so this information could/should also be preserved.
@lukeA's solution drops the dfm class twice (not his or your fault but mine!!), once in the %*%
and again in the <-
. The first can be avoided by using column-wise recycling and a standard *
instead of the matrix multiplication %*%
, since I don't think a method has been written for dfm-class for %*%
(which is why it defaults to the sparseMatrix
method). The second cannot currently be avoided if you reassign sub-matrix elements, but can be avoided if you simply replace one dfm-class object object with another.
To make the new dfm-class object in a way that preserves the class, this would work (and here I have made the problem slightly more complex by adding a second document and another feature):
str <- c("apple is better than banana", "banana banana apple much better")
weights <- c(apple = 5, banana = 3, much = 0.5)
mydfm <- dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE)
# use name matching for indexing, sorts too, returns NA where no match is found
newweights <- weights[features(mydfm)]
# reassign 1 to non-matched NAs
newweights[is.na(newweights)] <- 1
# works because of column-wise recycling of the vector
mydfm * newweights
## Document-feature matrix of: 2 documents, 4 features.
## 2 x 4 sparse Matrix of class "dfmSparse"
## features
## docs apple better banana much
## text1 5 3.0 5 0
## text2 1 0.5 2 0.5
One more note: I'd encourage the use dfm-class-specific methods for extracting things like the column names, e.g. features(mydfm)
rather than colnames(mydfm)
, even though these will probably remain equivalent.