Why isn't Hadoop implemented using MPI?

2019-03-08 05:16发布

问题:

Correct me if I'm wrong, but my understanding is that Hadoop does not use MPI for communication between different nodes.

What are the technical reasons for this?

I could hazard a few guesses, but I do not know enough of how MPI is implemented "under the hood" to know whether or not I'm right.

Come to think of it, I'm not entirely familiar with Hadoop's internals either. I understand the framework at a conceptual level (map/combine/shuffle/reduce and how that works at a high level) but I don't know the nitty gritty implementation details. I've always assumed Hadoop was transmitting serialized data structures (perhaps GPBs) over a TCP connection, eg during the shuffle phase. Let me know if that's not true.

回答1:

One of the big features of Hadoop/map-reduce is the fault tolerance. Fault tolerance is not supported in most (any?) current MPI implementations. It is being thought about for future versions of OpenMPI.

Sandia labs has a version of map-reduce which uses MPI, but it lacks fault tolerance.



回答2:

MPI is Message Passing Interface. Right there in the name - there is no data locality. You send the data to another node for it to be computed on. Thus MPI is network-bound in terms of performance when working with large data.

MapReduce with the Hadoop Distributed File System duplicates data so that you can do your compute in local storage - streaming off the disk and straight to the processor. Thus MapReduce takes advantage of local storage to avoid the network bottleneck when working with large data.

This is not to say that MapReduce doesn't use the network... it does: and the shuffle is often the slowest part of a job! But it uses it as little, and as efficiently as possible.

To sum it up: Hadoop (and Google's stuff before it) did not use MPI because it could not have used MPI and worked. MapReduce systems were developed specifically to address MPI's shortcomings in light of trends in hardware: disk capacity exploding (and data with it), disk speed stagnant, networks slow, processor gigahertz peaked, multi-core taking over Moore's law.



回答3:

The truth is Hadoop could be implemented using MPI. MapReduce has been used via MPI for as long as MPI has been around. MPI has functions like 'bcast' - broadcast all data, 'alltoall' - send all data to all nodes, 'reduce' and 'allreduce'. Hadoop removes the requirement to explicitly implement your data distribution and gather your result methods by packaging an outgoing communication command with a reduce command. The upside is you need to make sure your problem fits the 'reduce' function before you implement Hadoop. It could be your problem is a better fit for 'scatter'/'gather' and you should use Torque/MAUI/SGE with MPI instead of Hadoop. Finally, MPI does not write your data to disk as described in another post, unless you follow your receive method with a write to disk. It works just as Hadoop does by sending your process/data somewhere else to do the work. The important part is to understand your problem with enough detail to be sure MapReduce is the most efficient parallelization strategy, and be aware that many other strategies exist.



回答4:

Not an exact answer to your question, but this might help: What are some scenarios for which MPI is a better fit than MapReduce?



回答5:

In MapReduce 2.0 (MRv2) or YARN applications can be written (or being ported to run) on top of YARN.

Thus essentially there will be a Next Generation Apache Hadoop MapReduce(MAPREDUCE-279) and a way to support multiple programming paradigms on top of it. So one can write MPI applications on YARN. MapReduce programming paradigm will be supported as default always.

http://wiki.apache.org/hadoop/PoweredByYarn Should give a idea of what all applications are developed on top of YARN including Open MPI.



回答6:

There is no restriction that prevents MPI programs from using local disks. And of course MPI-programs always attempt to work locally on data - in RAM or on local disk - just like all parallel applications. In MPI 2.0 (which is not a future version, it's been here for a decade) it is possible to add and remove processes dynamically, which makes it possible to implement applications which can recover from e.g. a process dying on some node.

Perhaps hadoop is not using MPI because MPI usually requires coding in C or Fortran and has a more scientific/academic developer culture, while hadoop seems to be more driven by IT professionals with a strong Java bias. MPI is very low-level and error-prone. It allows very efficient use of hardware, RAM and network. Hadoop tries to be high-level and robust, with an efficiency penalty. MPI programming requires discipline and great care to be portable, and still requires compilation from sourcecode on each platform. Hadoop is highly portable, easy to install and allow pretty quick and dirty application development. It's a different scope.

Still, perhaps the hadoop hype will be followed by more resource-efficient alternatives, perhaps based on MPI.



回答7:

If we just look at the Map / Reduce steps and scheduling part of Hadoop, then I would argue MPI is a much better methodology / technology. MPI supports many different exchange patterns like broadcast, barrier, gather all, scatter / gather (or call it map-reduce). But Hadoop also has the HDFS. With this, the data can sit much closer to the processing nodes. And if you look at the problem space Hadoop-like technologies where used for, the outputs of the reduction steps were actually fairly large, and you wouldn't want to have all that information swamp your network. That's why Hadoop saves everything to disk. But the control messages could have used MPI, and the MPI messages could just have pointers (urls or file handles) to the actual data on disk ...