I've to implement the DBSCAN algorithm.
Assuming to start from this pseudocode
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
expandCluster(P, NeighborPts, C, eps, MinPts)
add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
regionQuery(P, eps)
return all points within P's eps-neighborhood
My code has to run on an Amazon EC2 Instance with Ubuntu Linux 64 bit.
The function regionQuery queries a MongoDB database to obtain all points within P's eps-neighborhood.
So, according to you, what is the best programming language to implement it to improve performances? C, PHP, Java (I don't think)?
I assume that you have a lot of points and need results fast - otherwise you can use almost anything.
It seems like map-reduce job for me
Map part would be loop "for each unvisited point" and should emit data construct containing
neighbors, candidate clusters and whatever else. In case point is classified as noise it should emit nothing.
Cluster expansion shall go into reduce and possibly finalize part - also language choice would be javascript and everything would happen inside mongo
Google for "parallel DBSCAN", and you will find a number of articles discussing how to parallelize this algorithm. Usually, it will change the algorithm quite a bit, for example it will require merging clusters.
Canopy pre-clustering may be a good preprocessing step for DBSCAN, too.
I forgot to reply to my own question. I finally implemented a MapReduce version of DBSCAN algorithm.
You can find it here (Hadoop).
This is the pseudo-code of how it works:
function map(P, eps, MinPts)
if P is unvisited then
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts then
do nothing
else
mark P as clusterized
prepare the key
create new cluster C
C.neighborPoints = NeighborPts
C.points = P
emit(key, C)
function reduce(key, clusters, eps, MinPts)
finalC is the final cluster
for all C in clusters do
finalC.points = finalC.points ∪ C.points
for all P in C.neighborPoints do
if P′ is not visited then
mark P′ as visited
NeighborPts′ = regionQuery(P′,eps)
if sizeof(NeighborPts′) ≥ MinPts then
NeighborPts = NeighborPts ∪ NeighborPts′
end if
end if
if P′ is not yet member of any cluster then
add P′ to cluster C
end if