I'm trying to solve the problem of having a co-occurence matrix. I have a datafile of transactions and items, and I want to see a matrix of the number of transactions where items appear together.
I'm a newbie in R programming and I'm having some fun finding out all the shortcuts that R has, rather than creating specific loops (I used to use C years ago and only sticking to Excel macros and SPSS now). I have checked the solutions here, but haven't found one that works (the closest is the solution given here: Co-occurrence matrix using SAC? - but it produced an error message when I used projecting_tm, I suspected that the cbind wasn't successful in my case.
Essentially I have a table containing the following:
TrxID Items Quant
Trx1 A 3
Trx1 B 1
Trx1 C 1
Trx2 E 3
Trx2 B 1
Trx3 B 1
Trx3 C 4
Trx4 D 1
Trx4 E 1
Trx4 A 1
Trx5 F 5
Trx5 B 3
Trx5 C 2
Trx5 D 1, etc.
I want to create something like:
A B C D E F
A 0 1 1 0 1 1
B 1 0 3 1 1 0
C 1 3 0 1 0 0
D 1 1 1 0 1 1
E 1 1 0 1 0 0
F 0 1 1 1 0 0
What I did was (and you'd probably laugh at my rookie R approach):
library(igraph)
library(tnet)
trx <- read.table("FileName.txt", header=TRUE)
transID <- t(trx[1])
items <- t(trx[2])
id_item <- cbind(items,transID)
item_item <- projecting_tm(id_item, method="sum")
item_item <- tnet_igraph(item_item,type="weighted one-mode tnet")
item_matrix <-get.adjacency(item_item,attr="weight")
item_matrix
As mentioned above the cbind was probably unsuccessful, so the projecting_tm couldn't give me any result.
Any alternative approach or a correction to my method?
Your help would be much appreciated!
Using "dat" from either of the answers above, try
crossprod
andtable
:I would use xtabs for this:
I threw in the
sparse = TRUE
to show that this can work for very large data sets.I'd use a combination of the reshape2 package and matrix algebra:
For the graphing maybe...
This is actually very easy and clean if you create a bipartite graph first, where the top nodes are the transactions and the bottom nodes are the items. Then you create a projection to the bottom nodes.
For efficiency reasons, especially on sparse data, I would recommend using a sparse matrix.
I was giving each solution posted in this thread a try. None of them worked with large matrices (I was working with a 1,500 x 2,000,000 matrix).
A little bit off-topic: after calculating a co-occurrence matrix, I usually want to calculate the distance between individual items. The cosine similarity / distance can be calculated efficiently on the co-occurrence matrix like this: