how to implement eigenvalue calculation with MapRe

2019-03-16 00:27发布

It is possible because PageRank was a form of eigenvalue and that is why MapReduce introduced. But there seems problems in actual implementation, such as every slave computer have to maintain a copy of the matrix?

5条回答
贪生不怕死
2楼-- · 2019-03-16 01:14

PageRank solves the dominant eigenvector problem by iteratively finding the steady-state discrete flow condition of the network.

If NxM matrix A describes the link weight (amount of flow) from node n to node m, then

p_{n+1} = A . p_{n} 

In the limit where p has converged to a steady state (p_n+1 = p_n), this is an eigenvector problem with eigenvalue 1.

The PageRank algorithm doesn't require the matrix to be held in memory, but is inefficient on dense (non-sparse) matrices. For dense matrices, MapReduce is the wrong solution -- you need locality and broad exchange among nodes -- and you should instead look at LaPACK and MPI and friends.

You can see a working pagerank implementation in the wukong library (hadoop streaming for ruby) or in the Heretrix pagerank submodule. (The heretrix code runs independently of Heretrix)

(disclaimer: I am an author of wukong.)

查看更多
beautiful°
3楼-- · 2019-03-16 01:15

I suspect it is intractable for most matrices except those w/ special structures (e.g. sparse matrices or ones w/ certain block patterns). There's way too much coupling between matrix coefficients and eigenvalues.

PageRank uses a very sparse matrix of a special form, and any conclusions from calculating its eigenvalues almost certainly don't extend to general matrices. (edit: here's another reference that looks interesting)

查看更多
一夜七次
4楼-- · 2019-03-16 01:15

The apache hama project has some interesting implementation of the Jacobi eigenvalue algorithm. It runs on hadoop. Note the rotation happens in the scan of the matrix not in the map reduce.

查看更多
【Aperson】
5楼-- · 2019-03-16 01:19

PREAMBLE:

Given the right sequestration of data, one could achieve parallel computing results without a complete dataset on every machine.

Take for example the following loop:

for (int i = 0; i < m[].length; i++)
{
    for (int j = 0; j < m[i].length; j++)
    {
        m[i][j]++; 
    }
}

And given a matrix of the following layout:

       j=0   j=1   j=2
 i=0  [   ] [   ] [   ]
 i=1  [   ] [   ] [   ]
 i=2  [   ] [   ] [   ]

Parallel constructs exist such that the J column can be sent to each computer and the single columns are computed in parallel. The difficult part of parallelization comes when you've got loops that contain dependencies.

for (int i = 0; i < m[].length; i++)
{
    for (int j = 0; j < m[i].length; j++)
    {
        //For obvious reasons, matrix index verification code removed
        m[i][j] = m[i/2][j] + m[i][j+7]; 
    }
}

Obviously a loop like the one above becomes extremely problematic (notice the matrix indexers.) But techniques do exist for unrolling these types of loops and creating effective parallel algorithms.

ANSWER:

It is possible that google developed a solution to compute an eigenvalue without maintaining a copy of the matrix on all slave computers. -Or- They used something like Monte Carlo or some other Approximation Algorithm to develop a "close enough" calculation.

In fact, I'd go so far as to say that Google will have gone to as great of lengths as possible to make any calculation required for their PageRank algorithm as efficient as possible. When you're running machines such as these and this (notice the Ethernet cable) you can't be transferring large datasets(100s of gigs) because it is impossible given their hardware limitations of commodity NIC cards.

With that said, Google is good at surprising the programmer community and their implementation could be entirely different.

POSTAMBLE:

Some good resources for parallel computing would include OpenMP and MPI. Both parallel implementations approach parallel computing from very different paradigms, some of which stems from machine implementation (cluster vs. distributed computing.)

查看更多
Anthone
6楼-- · 2019-03-16 01:27

I can answer myself now. The PageRank algorithm take advantage of sparse matrix where it should converge at the eigenvalue with several self-multiply. Thus, in PageRank practice, the Map/Reduce procedure is valid. You can perform matrix multiply in Map procedure and form a sparse matrix in Reduce procedure. But for general matrix eigenvalue finding, it is still a tricky problem.

查看更多
登录 后发表回答