Not sure why dcast() this data set results in drop

2019-03-01 01:32发布

问题:

I have a data frame that looks like:

   id fromuserid touserid from_country to_country length
1   1   54525953 47195889           US         US      2
2   2   54525953 54361607           US         US      1
3   3   54525953 53571081           US         US      2
4   4   41943048 55379244           US         US      1
5   5   47185938 53140304           US         PR      1
6   6   47185938 54121387           US         US      1
7   7   54525974 50928645           GB         GB      1
8   8   54525974 53495302           GB         GB      1
9   9   51380247 45214216           SG         SG      2
10 10   51380247 43972484           SG         US      2

Each row describes a number of messages (length) sent from one user to another user.

What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.

There are almost 200 countries. I use the function dcast as follows:

countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)

This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.

At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.

So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?

回答1:

Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(chats$from_country, 
                                               chats$to_country)))
chats$to_country <- factor(chats$to_country, 
                           levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
#   from_country US GB SG PR
# 1           US  5  0  0  1
# 2           GB  0  2  0  0
# 3           SG  1  0  1  0
# 4           PR  0  0  0  0

If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(levels(chats$from_country), 
                                               levels(chats$to_country)))

Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.