Take this simple data frame of linked ids:
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
> test
id1 id2
1 10 1
2 10 36
3 1 24
4 1 45
5 24 300
6 8 11
I now want to group together all the ids which link. By 'link', I mean follow through the chain of links so that all ids in one group are labelled together. A kind of branching structure. i.e:
Group 1
10 --> 1, 1 --> (24,45)
24 --> 300
300 --> NULL
45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300
Group 2
8 --> 11
11 --> NULL
Final group members: 8,11
Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match
or %in%
to go down each branch, but am truly stumped this time.
The final result I would be chasing is:
result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))
> result
group id
1 1 10
2 1 1
3 1 24
4 1 36
5 1 45
6 1 300
7 2 8
8 2 11
The Bioconductor package RBGL (an R interface to the BOOST graph library) contains a function,
connectedComp()
, which identifies the connected components in a graph -- just what you are wanting.(To use the function, you will first need to install the graph and RBGL packages, available here and here.)
Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the
igraph
package. For those that are searching and come across this answer, mytest
dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)Without using packages: