I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th
cell represents the messages sent from country i
to country j
. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?
Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
Why is this necessary? If they are already factors, then
c(chats$from_country, chats$to_country)
will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in<NA>
.