I hear a lot about map/reduce, especially in the context of Google's massively parallel compute system. What exactly is it?
相关问题
- Name for a method that has only side effects
- Unable to generate jar file for Hadoop
- how to calculate count and unique count over two f
- Spark - Group by Key then Count by Value
- Best way to keep the user-interface up-to-date?
相关文章
- mapreduce count example
- Should client-server code be written in one “proje
- Algorithm for maximizing coverage of rectangular a
- Is there an existing solution for these particular
- What is Scope Creep? [closed]
- Exception in thread “main” java.lang.NoClassDefFou
- Compute first order derivative with MongoDB aggreg
- How can I modify .xfdl files? (Update #1)
After getting most frustrated with either very long waffley or very short vague blog posts I eventually discovered this very good rigorous concise paper.
Then I went ahead and made it more concise by translating into Scala, where I've provided the simplest case where a user simply just specifies the
map
andreduce
parts of the application. In Hadoop/Spark, strictly speaking, a more complex model of programming is employed that require the user to explicitly specify 4 more functions outlined here: http://en.wikipedia.org/wiki/MapReduce#DataflowMap is a function that applies another function to all the items on a list, to produce another list with all the return values on it. (Another way of saying "apply f to x" is "call f, passing it x". So sometimes it sounds nicer to say "apply" instead of "call".)
This is how map is probably written in C# (it's called
Select
and is in the standard library):As you're a Java dude, and Joel Spolsky likes to tell GROSSLY UNFAIR LIES about how crappy Java is (actually, he's not lying, it is crappy, but I'm trying to win you over), here's my very rough attempt at a Java version (I have no Java compiler, and I vaguely remember Java version 1.1!):
I'm sure this can be improved in a million ways. But it's the basic idea.
Reduce is a function that turns all the items on a list into a single value. To do this, it needs to be given another function
func
that turns two items into a single value. It would work by giving the first two items tofunc
. Then the result of that along with the third item. Then the result of that with the fourth item, and so on until all the items have gone and we're left with one value.In C# reduce is called
Aggregate
and is again in the standard library. I'll skip straight to a Java version:These Java versions need generics adding to them, but I don't know how to do that in Java. But you should be able to pass them anonymous inner classes to provide the functors:
Hopefully generics would get rid of the casts. The typesafe equivalent in C# is:
Why is this "cool"? Simple ways of breaking up larger calculations into smaller pieces, so they can be put back together in different ways, are always cool. The way Google applies this idea is to parallelization, because both map and reduce can be shared out over several computers.
But the key requirement is NOT that your language can treat functions as values. Any OO language can do that. The actual requirement for parallelization is that the little
func
functions you pass to map and reduce must not use or update any state. They must return a value that is dependent only on the argument(s) passed to them. Otherwise, the results will be completely screwed up when you try to run the whole thing in parallel.Map is a native JS method that can be applied to an array. It creates a new array as a result of some function mapped to every element in the original array. So if you mapped a function(element) { return element * 2;}, it would return a new array with every element doubled. The original array would go unmodified.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map
Reduce is a native JS method that can also be applied to an array. It applies a function to an array and has an initial output value called an accumulator. It loops through each element in the array, applies a function, and reduces them to a single value (which begins as the accumulator). It is useful because you can have any output you want, you just have to start with that type of accumulator. So if I wanted to reduce something into an object, I would start with an accumulator {}.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/Reduce?v=a
From the abstract of Google's MapReduce research publication page:
The advantage of MapReduce is that the processing can be performed in parallel on multiple processing nodes (multiple servers) so it is a system that can scale very well.
Since it's based from the functional programming model, the
map
andreduce
steps each do not have any side-effects (the state and results from each subsection of amap
process does not depend on another), so the data set being mapped and reduced can each be separated over multiple processing nodes.Joel's Can Your Programming Language Do This? piece discusses how understanding functional programming was essential in Google to come up with MapReduce, which powers its search engine. It's a very good read if you're unfamiliar with functional programming and how it allows scalable code.
See also: Wikipedia: MapReduce
Related question: Please explain mapreduce simply
MapReduce and/or SQL:
http://www.data-miners.com/blog/2008/01/mapreduce-and-sql-aggregations.html
http://www.data-miners.com/blog/
criticism of MapReduce
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
http://www.databasecolumn.com/2008/01/mapreduce-continued.html
MapReduce Explained.
It explains better than what I can. Does it help?