Create sparse matrix from tweets

2019-09-06 21:40发布

问题:

I have some tweets and other variables that I would like to convert into a sparse matrix.

This is basically what my data looks like. Right now it is saved in a data.table with one column that contains the tweet and one column that contains the score.

Tweet               Score
Sample Tweet :)        1
Different Tweet        0

I would like to convert this into a matrix that looks like this:

Score Sample Tweet Different :)
    1      1     1         0  1
    0      0     1         1  0

Where there is one row in the sparse matrix for each row in my data.table. Is there an easy way to do this in R?

回答1:

This is close to what you want

library(Matrix)
words = unique(unlist(strsplit(dt[, Tweet], ' ')))

M = Matrix(0, nrow = NROW(dt), ncol = length(words))
colnames(M) = words

for(j in 1:length(words)){
  M[, j] = grepl(paste0('\\b', words[j], '\\b'), dt[, Tweet])
}

M = cbind(M, as.matrix(dt[, setdiff(names(dt),'Tweet'), with=F]))

#2 x 5 sparse Matrix of class "dgCMatrix"
#     Sample Tweet :) Different Score
#[1,]      1     1  .         .     1
#[2,]      .     1  .         1     .

The only small issue is that the regex is not recognising ':)' as a word. Maybe someone who knows regex better can advise how to fix this.