mapreduce distance calculation in hadoop

Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points.

Looking for any resources.

Edit

This is a very intelligent solution. I have tried some how like the first algorithm, and I get almost what I was looking for. I am not concerned about optimizing the program at the moment, but my problem was the dist(X,Y) function was not working. When I got all the points on the reducer, I was unable to go through all the points on an Iterator and calculate the distance. Someone on stackoverflow.com told me that the Iterator on hadoop is different than the normal JAVA Iterator, i am not sure about that. But if i can find a simple way to go through the Iterator on my dist() function, i can use your second algorithm to optimize.

//This is your code and I am refering to that code too, just to make my point clear.
map(x,y) {
  for i in 1:N #number of points
    emit(i, (x,y)) //i did exactly like this

    reduce (i, X)
    p1 = X[i]
    for j in i:N
      // here is my problem, I can't get the values from the Iterator.
      emit(dist(X[i], X[j]))

标签： java hadoop mapreduce parallel-processing

2条回答

做自己的国王

2楼-- · 2019-05-11 15:22

you need to do a self join on that data set. In hive that would look like, more or less

select dist(P1.x,P1.y,P2.x, P2.y) from points P1 join points P2 on (True) where P1.x < P2.x or (P1.x = P2.x and P1.y < P2.y)

The function dist would need to be implemented using other hive functions or written in Java and added as a UDF. Also I am not sure about the True constant but you can write 0=0 to the same effect. The where clause is to avoid computing the same distances twice or 0 distances. The question is: would hive optimize this the way you can do programming carefully in hadoop? I am not sure. This is a sketch in hadoop

map(x,y) {
  for i in 1:N #number of points
     emit(i, (x,y))

reduce (i, X)
  p1 = X[i]
  for j in i:N
     emit(dist(X[i], X[j]))

For this to work you need X to get to the reducer sorted in some order, for instance by x and then by y using secondary sort keys (that do not affect the grouping). This way every reducer gets a copy of all the points and works on a column of the distance matrix you are trying to generate. The memory requirements are minimal. You could trade some communication for memory by re-organizing the computation so that every reducer computes a square submatrix of the final matrix, knowing only two subsets of the points and calculating the distances among all of them. To achieve this, you need to make explicit the order of your points, say you are storing i, x, y

map(i,x,y) {
  for j in 1:N/k #k is size of submatrix
     emit((i/k, j), ("row", (x,y)))
     emit((j, i/k), ("col", (x,y)))

reduce ((a,b), Z)
  split Z in rows X and cols Y
  for x in X
     for y in Y
     emit(dist(x,y))

In this case you can see that the map phase emits only 2*N*N/k points, whereas the previous algorithm emitted N^2. Here we have (N/k)^2 reducers vs N for the other one. Each reducer has to hold k values in memory (using the secondary key technique to have all the rows get to the reducer before all the columns), vs only 2 before. So you see there are tradeoffs and for the second algorithm you can use the parameter k for perf tuning.

0人赞添加讨论(0) 举报

该账号已被封号

3楼-- · 2019-05-11 15:37

This problem does not sound like a good fit for map-reduce since you're not really able to break it into pieces and calculate each piece independently. If you could have a separate program that generates the complete graph of your points as a list (x1,y1,x2,y2) then you could do a straightforward map to get the distance.

0人赞添加讨论(0) 举报

mapreduce distance calculation in hadoop

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间