我已经实现DBSCAN算法。 假设从这个伪代码开始
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
expandCluster(P, NeighborPts, C, eps, MinPts)
add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
regionQuery(P, eps)
return all points within P's eps-neighborhood
我的代码必须在Amazon EC2实例与Ubuntu Linux操作系统的64位运行。
该功能regionQuery查询MongoDB数据库获得普的EPS-邻域内所有点。
因此,根据你,什么是最好的编程语言来实现它,以提高性能? C,PHP,Java的 (我不认为)?
我假设你有一分不少,需要快速的结果 - 否则,你可以使用几乎所有的东西。
这似乎是地图,减少工作吗
地图部分是环“为每个未访问点”,应该发出的数据构建包含邻居,候选聚类和任何其他。 如果点分类为噪音应该发出什么。
集群扩张将进入减少,并可能完成的部分 - 也与语言的选择是JavaScript和一切会发生内部蒙戈
谷歌为“并行DBSCAN”,你会发现许多文章讨论如何并行算法。 通常情况下,它会改变算法不少,例如它需要合并集群。
冠层的预集群可能是DBSCAN一个很好的预处理步骤,太。
我忘了回答我的问题。 我终于实现DBSCAN算法的MapReduce的版本。 你可以找到它在这里 (Hadoop的)。
这是它是如何工作的伪代码:
function map(P, eps, MinPts)
if P is unvisited then
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts then
do nothing
else
mark P as clusterized
prepare the key
create new cluster C
C.neighborPoints = NeighborPts
C.points = P
emit(key, C)
function reduce(key, clusters, eps, MinPts)
finalC is the final cluster
for all C in clusters do
finalC.points = finalC.points ∪ C.points
for all P in C.neighborPoints do
if P′ is not visited then
mark P′ as visited
NeighborPts′ = regionQuery(P′,eps)
if sizeof(NeighborPts′) ≥ MinPts then
NeighborPts = NeighborPts ∪ NeighborPts′
end if
end if
if P′ is not yet member of any cluster then
add P′ to cluster C
end if
文章来源: Best programming language to implement DBSCAN algorithm querying a MongoDB database?