I have a file that looks like this:
type created_at repository_name
1 IssuesEvent 2012-03-11 06:48:31 bootstrap
2 IssuesEvent 2012-03-11 06:48:31 bootstrap
3 IssuesEvent 2012-03-11 06:48:31 bootstrap
4 IssuesEvent 2012-03-11 06:52:50 bootstrap
5 IssuesEvent 2012-03-11 06:52:50 bootstrap
6 IssuesEvent 2012-03-11 06:52:50 bootstrap
7 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
8 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
9 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
10 IssuesEvent 2012-03-11 07:03:58 bootstrap
11 IssuesEvent 2012-03-11 07:03:58 bootstrap
12 IssuesEvent 2012-03-11 07:03:58 bootstrap
13 WatchEvent 2012-03-11 07:15:44 bootstrap
14 WatchEvent 2012-03-11 07:15:44 bootstrap
15 WatchEvent 2012-03-11 07:15:44 bootstrap
16 WatchEvent 2012-03-11 07:18:45 hogan.js
17 WatchEvent 2012-03-11 07:18:45 hogan.js
18 WatchEvent 2012-03-11 07:18:45 hogan.js
The dataset that I'm working with can be accessed on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv.
I want to create a table that has a column for each entry in the "repository_name" column (e.g. bootstrap, hogan.js). In that column I need to have the data from the "type" column that corresponds to that entry (i.e. only rows form the current "type" column that also has the value "bootstrap" in the current "repository_name" column should fall under the new "bootstrap" column). Hence:
- Time stamps is just for ordering and do not need to by synchronized across the row (in fact they can be deleted, as the data is already sorted according to timestamps)
- Even if "IssuesEvent" is repeated 10x I need to retain all of these, since I will be doing sequence analysis using the R package TraMineR
- Columns can be of unequal length
- There is no relationship between the columns for different repos ("repository_name")
In other words, I would want a table that looks something like this:
bootstrap hogan.js
1 IssuesEvent PushEvent
2 IssuesEvent IssuesEvent
3 OssueCommentEvent WatchEvent
How can I accomplish this in R?
Some of my failed attempts using the reshape package can be found on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R.
I just joined stackoverflow; hopefully my answer is somewhat useful.
By table, I assume you mean that you want a data frame. However, it seems unlikely that columns would be of equal length, and it looks like rows wouldn't have much meaning anyway. Maybe a list would be better?
Here's a messy solution:
Using @flodel's
data
object, you can also tryaggregate()
, but with many event types, this would quickly become unreadable:You can also try
reshape()
and do some trickery witht()
(transpose), as below.But, I find that all of those NAs are obnoxious; I would say that @flodel's answer is the most direct and probably the most useful in the long run (that is, not knowing exactly what you want to do once you get the data in this form).
Update (more trickery)
(Actually, this is a "SO is perfect for procrastination" moment)
My final (terribly inefficient) answer is as follows.
Proceed as above, but drop the date/time stuff, and convert from factors to characters.
rbind()
andcbind()
will "recycle" objects of different lengths to make them the same length, but we don't want that. So, we need to force R to believe that the lengths are the same. So, find out the max length. While we're at it, extract a cleaned up version of the names in thetemp3
object.Now, extract the items from
temp3
into your workspace, and make sure they are both the same length.Finally, use
cbind()
to put your data together.Your sample data:
I gather from your expected output that you want only one
type
when it shows up multiple times for the samecreated_at
value, in other words you want to remove duplicates:Then, to extract all
type
entries perrepository_name
in the order they appear, you can simply use:It returns a list which is the R data structure of choice for a collection of vectors with different lengths.
Edit: Now that you have provided an example of your output data, it has become more apparent that your expected output is indeed a data.frame. You can convert the list above into a data.frame padded with
NA
s using the following function:You can then save that to a file using
to get the exact same output format as the one you published on github.