It is possible because PageRank was a form of eigenvalue and that is why MapReduce introduced. But there seems problems in actual implementation, such as every slave computer have to maintain a copy of the matrix?
相关问题
- Finding k smallest elements in a min heap - worst-
- Spark on Yarn Container Failure
- binary search tree path list
- High cost encryption but less cost decryption
- d3.js moving average with previous and next data v
相关文章
- Java写文件至HDFS失败
- What are the problems associated to Best First Sea
- ceil conterpart for Math.floorDiv in Java?
- mapreduce count example
- Coin change DP solution to keep track of coins
- why 48 bit seed in util Random class?
- Algorithm for partially filling a polygonal mesh
- Robust polygon normal calculation
PageRank solves the dominant eigenvector problem by iteratively finding the steady-state discrete flow condition of the network.
If NxM matrix A describes the link weight (amount of flow) from node n to node m, then
In the limit where p has converged to a steady state (p_n+1 = p_n), this is an eigenvector problem with eigenvalue 1.
The PageRank algorithm doesn't require the matrix to be held in memory, but is inefficient on dense (non-sparse) matrices. For dense matrices, MapReduce is the wrong solution -- you need locality and broad exchange among nodes -- and you should instead look at LaPACK and MPI and friends.
You can see a working pagerank implementation in the wukong library (hadoop streaming for ruby) or in the Heretrix pagerank submodule. (The heretrix code runs independently of Heretrix)
(disclaimer: I am an author of wukong.)
I suspect it is intractable for most matrices except those w/ special structures (e.g. sparse matrices or ones w/ certain block patterns). There's way too much coupling between matrix coefficients and eigenvalues.
PageRank uses a very sparse matrix of a special form, and any conclusions from calculating its eigenvalues almost certainly don't extend to general matrices. (edit: here's another reference that looks interesting)
The apache hama project has some interesting implementation of the Jacobi eigenvalue algorithm. It runs on hadoop. Note the rotation happens in the scan of the matrix not in the map reduce.
PREAMBLE:
Given the right sequestration of data, one could achieve parallel computing results without a complete dataset on every machine.
Take for example the following loop:
And given a matrix of the following layout:
Parallel constructs exist such that the J column can be sent to each computer and the single columns are computed in parallel. The difficult part of parallelization comes when you've got loops that contain dependencies.
Obviously a loop like the one above becomes extremely problematic (notice the matrix indexers.) But techniques do exist for unrolling these types of loops and creating effective parallel algorithms.
ANSWER:
It is possible that google developed a solution to compute an eigenvalue without maintaining a copy of the matrix on all slave computers. -Or- They used something like Monte Carlo or some other Approximation Algorithm to develop a "close enough" calculation.
In fact, I'd go so far as to say that Google will have gone to as great of lengths as possible to make any calculation required for their PageRank algorithm as efficient as possible. When you're running machines such as these and this (notice the Ethernet cable) you can't be transferring large datasets(100s of gigs) because it is impossible given their hardware limitations of commodity NIC cards.
With that said, Google is good at surprising the programmer community and their implementation could be entirely different.
POSTAMBLE:
Some good resources for parallel computing would include OpenMP and MPI. Both parallel implementations approach parallel computing from very different paradigms, some of which stems from machine implementation (cluster vs. distributed computing.)
I can answer myself now. The PageRank algorithm take advantage of sparse matrix where it should converge at the eigenvalue with several self-multiply. Thus, in PageRank practice, the Map/Reduce procedure is valid. You can perform matrix multiply in Map procedure and form a sparse matrix in Reduce procedure. But for general matrix eigenvalue finding, it is still a tricky problem.