Best programming language to implement DBSCAN algo

2020-06-29 14:31发布

问题:

I've to implement the DBSCAN algorithm. Assuming to start from this pseudocode

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)

expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
      if P' is not visited
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      if P' is not yet member of any cluster
         add P' to cluster C

regionQuery(P, eps)
   return all points within P's eps-neighborhood

My code has to run on an Amazon EC2 Instance with Ubuntu Linux 64 bit.

The function regionQuery queries a MongoDB database to obtain all points within P's eps-neighborhood.

So, according to you, what is the best programming language to implement it to improve performances? C, PHP, Java (I don't think)?

回答1:

I assume that you have a lot of points and need results fast - otherwise you can use almost anything.

It seems like map-reduce job for me

Map part would be loop "for each unvisited point" and should emit data construct containing neighbors, candidate clusters and whatever else. In case point is classified as noise it should emit nothing.

Cluster expansion shall go into reduce and possibly finalize part - also language choice would be javascript and everything would happen inside mongo



回答2:

Google for "parallel DBSCAN", and you will find a number of articles discussing how to parallelize this algorithm. Usually, it will change the algorithm quite a bit, for example it will require merging clusters.

Canopy pre-clustering may be a good preprocessing step for DBSCAN, too.



回答3:

I forgot to reply to my own question. I finally implemented a MapReduce version of DBSCAN algorithm. You can find it here (Hadoop).

This is the pseudo-code of how it works:

function map(P, eps, MinPts)
    if P is unvisited then
        mark P as visited
        NeighborPts = regionQuery(P, eps)
        if sizeof(NeighborPts) < MinPts then
            do nothing
        else
            mark P as clusterized
            prepare the key
            create new cluster C
            C.neighborPoints = NeighborPts
            C.points = P
            emit(key, C)

function reduce(key, clusters, eps, MinPts)
    finalC is the final cluster
    for all C in clusters do
        finalC.points = finalC.points ∪ C.points
        for all P in C.neighborPoints do
            if P′ is not visited then
                mark P′ as visited
                NeighborPts′ = regionQuery(P′,eps)
                if sizeof(NeighborPts′) ≥ MinPts then
                    NeighborPts = NeighborPts ∪ NeighborPts′
                end if
            end if
            if P′ is not yet member of any cluster then
                add P′ to cluster C
            end if