identify groups of linked episodes which chain tog

Take this simple data frame of linked ids:

test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

> test
  id1 id2
1  10   1
2  10  36
3   1  24
4   1  45
5  24 300
6   8  11

I now want to group together all the ids which link. By 'link', I mean follow through the chain of links so that all ids in one group are labelled together. A kind of branching structure. i.e:

Group 1
10 --> 1,   1 --> (24,45)
                   24 --> 300
                          300 --> NULL
                   45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300

Group 2
8 --> 11
      11 --> NULL
Final group members: 8,11

Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match or %in% to go down each branch, but am truly stumped this time.

The final result I would be chasing is:

result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))

> result
  group  id
1     1  10
2     1   1
3     1  24
4     1  36
5     1  45
6     1 300
7     2   8
8     2  11

标签： r grouping identifier linkage

3条回答

smile是对你的礼貌

2楼-- · 2019-01-03 11:39

The Bioconductor package RBGL (an R interface to the BOOST graph library) contains a function, connectedComp(), which identifies the connected components in a graph -- just what you are wanting.

(To use the function, you will first need to install the graph and RBGL packages, available here and here.)

library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

## Convert your 'from-to' data to a 'node and edge-list' representation  
## used by the 'graph' & 'RBGL' packages 
g <- ftM2graphNEL(as.matrix(test))

## Extract the connected components
cc <- connectedComp(g)

## Massage results into the format you're after 
ld <- lapply(seq_along(cc), 
             function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
#   group  id
# 1     1  10
# 2     1   1
# 3     1  24
# 4     1  36
# 5     1  45
# 6     1 300
# 7     2   8
# 8     2  11

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2019-01-03 11:55

Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the igraph package. For those that are searching and come across this answer, my test dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)

library(igraph)
test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11))
gr.test <- graph.data.frame(test)
links <- data.frame(id=unique(unlist(test)),group=clusters(gr.test)$membership)
links[order(links$group),]

#   id group
#1  10     1
#2   1     1
#3  24     1
#5  36     1
#6  45     1
#7 300     1
#4   8     2
#8  11     2

0人赞添加讨论(0) 举报

孤傲高冷的网名

4楼-- · 2019-01-03 11:55

Without using packages:

# 2 sets of test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

grouppairs <- function(df){

  # from wide to long format; assumes df is 2 columns of related id's
  test <- data.frame(group = 1:nrow(df),val = unlist(df))

  # keep moving to next pair until all same values have same group
  i <- 0
  while(any(duplicated(unique(test)$val))){
    i <- i+1

    # get group of matching values
    matches <- test[test$val == test$val[i],'group']

    # change all groups with matching values to same group
    test[test$group %in% matches,'group'] <- test$group[i]
  }

  # renumber starting from 1 and show only unique values in group order
  test$group <- match(test$group, sort(unique(test$group)))
  unique(test)[order(unique(test)$group), ]
}

# test
grouppairs(test)
grouppairs(mytest)

0人赞添加讨论(0) 举报

identify groups of linked episodes which chain tog

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间