Clustering by distance in R

2019-05-16 12:00发布

I have a vector of integers which I wish to divide into clusters so that the distance between any two clusters is greater than a lower bound, and within any cluster, the distance between two elements is less than an upper bound.

For example, suppose we have the following vector:

1, 4, 5, 6, 9, 29, 32, 36

And set the aforementioned lower bound and upper bound to 19 and 9 respectively, the two vectors below should be a possible result:

1, 4, 5, 6, 9

29, 32, 36


Thanks to @flodel 's comments, I realized this kind of clustering may be impossible. So I would like to modify the questions a bit:

What are the possible clustering methods if I impose only the between cluster distance lower bound? What are the possible clustering methods if I impose only the within cluster distance upper bound?

2条回答
闹够了就滚
2楼-- · 2019-05-16 12:26

Here's a simple algorithm that will work, explained conceptually (implementation details omitted):

  1. Ensure your list is sorted.
  2. Place a "marker" between every pair of consecutive elements that are more than lower_bound apart. These mark all the possible cluster boundaries.
  3. Include a marker before the beginning of the list and after the end.
  4. Go through pairs of markers in order, and for each pair left_marker and right_marker, check if the distance between the element immediately to the right of the left_marker and the element immediately to the left of the right_marker is less than upper_bound apart.
  5. If the previous step ever returns false, the clustering is impossible.
  6. Otherwise, the markers form the boundaries of the desired clusterings.

Applying this to your example, we get:

  1. Sorted: 1, 4, 5, 6, 9, 26, 29, 32
  2. Markers: 1, 4, 5, 6, 9 | 26, 29, 32
  3. Additional start/end markers: | 1, 4, 5, 6, 9 | 26, 29, 32 |
  4. Check "upper bound" constraint: (9-1) = 8 < 9: TRUE; (32 - 26) = 6 < 9: TRUE
  5. None of the comparisons returned false
  6. Desired clustering: (1, 4, 5, 6, 9), (26, 29, 32)

EDIT: Original poster relaxed the conditions of the problem.

If you only want to satisfy the lower bound condition:

  1. Ensure your list is sorted.
  2. Place a marker between every pair of consecutive elements that are more than lower_bound apart.
  3. Include a marker before the beginning and after the end.
  4. These markers form the boundaries of the desired clustering.

The following gets you step 2 assuming your vector is already sorted:

# Given
vec <- c(1, 4, 5, 6, 9, 29, 32, 26)
lower_bound <- 19

f <- function(x) {
  return(vec[x+1] - vec[x] > lower_bound);
}
indices <- seq(length(vec)-1)
marker_positions <- Position(f, indices)
查看更多
做个烂人
3楼-- · 2019-05-16 12:52

What are the possible clustering methods if I impose only the between cluster distance lower bound?

Hierarchical clustering with single linkage:

x <- c(1, 4, 5, 6, 9, 29, 32, 46, 55)
tree <- hclust(dist(x), method = "single")
split(x, cutree(tree, h = 19))

# $`1`
# [1] 1 4 5 6 9
# 
# $`2`
# [1] 29 32 46 55

What are the possible clustering methods if I impose only the within cluster distance upper bound?

Hierarchical clustering with complete linkage:

x <- c(1, 4, 5, 6, 9, 20, 26, 29, 32)
tree <- hclust(dist(x), method = "complete")
split(x, cutree(tree, h = 9))

# $`1`
# [1] 1 4 5 6 9
# 
# $`2`
# [1] 20
# 
# $`3`
# [1] 26 29 32
查看更多
登录 后发表回答