I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need the same data are faster than other approaches like Map Reduce.
So the question I had is that if this is the case, then just add a caching engine inside of MR frameworks like Yarn/Hadoop.
Why to create a new framework altogether?
I am sure I am missing something here and you will be able to point me to some documentation which educates me more on spark.
Caching + in memory computation is definitely a big thing for spark, However there are other things.
RDD(Resilient Distributed Data set): an RDD is the main abstraction of spark. It allows recovery of failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDD. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.
Spark API: Hadoop MapReduce has a very strict API that doesn't allow for as much versatility. Since spark abstracts away many of the low level details it allows for more productivity. Also things like broadcast variables and accumulators are much more versatile than DistributedCache and counters IMO.
Spark Streaming: spark streaming is based on a paper Discretized Streams, which proposes a new model for doing windowed computations on streams using micro batches. Hadoop doesn't support anything like this.
As a product of in memory computation spark sort of acts as it's own flow scheduler. Whereas with standard MR you need an external job scheduler like Azkaban or Oozie to schedule complex flows
The hadoop project is made up of MapReduce, YARN, commons and HDFS; spark however is attempting to create one unified big data platform with libraries (in the same repo) for machine learning, graph processing, streaming, multiple sql type libraries and I believe a deep learning library is in the beginning stages. While none of this is strictly a feature of spark it is a product of spark's computing model. Tachyon and BlinkDB are two other technologies that are built around spark.
a lot was covered by aaronman and samthebest. I have a few more points.
Spark has much lower per job and per task overhead. It gives it ability to be applied to the cases where Hadoop MR is not applicable. It is cases when reply is needed in 1-30 seconds.
Low per task overhead makes Spark more efficient for even big jobs with a lot of short tasks. As a very rough estimation - when task takes 1 second Spark will be 2 times more efficient then Hadoop MR.
Spark has lower abstraction then MR - it is graph of computations. As a result it is possible to implement more efficient processing then MR - specifically in cases when sorting is not needed. In other words - in MR we always pay for the sorting, but in Spark - we do not have to.
I think there are three primary reasons.
The main two reasons stem from the fact that, usually, one does not run a single MapReduce job, but rather a set of jobs in sequence.
Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark's shuffle implementation works very similarly to MapReduce's: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.
So its much more than just caching. Aaronman covered a lot so ill only add what he missed.
Raw performance w/o caching is 2-10x faster due to a generally more efficient and well archetected framework. E.g. 1 jvm per node with akka threads is better than forking a whole process for each task.
Scala API. Scala stands for Scalable Language and is clearly the best language to choose for parallel processing. They say Scala cuts down code by 2-5x, but in my experience from refactoring code in other languages - especially java mapreduce code, its more like 10-100x less code. Seriously I have refactored 100s of LOC from java into a handful of Scala / Spark. Its also much easier to read and reason about. Spark is even more concise and easy to use than the Hadoop abstraction tools like pig & hive, its even better than Scalding.
Spark has a repl / shell. The need for a compilation-deployment cycle in order to run simple jobs is eliminated. One can interactively play with data just like one uses Bash to poke around a system.
The last thing that comes to mind is ease of integration with Big Table DBs, like cassandra and hbase. In cass to read a table in order to do some analysis one just does
Similar things are expected for HBase. Now try doing that in any other MPP framework!!
UPDATE thought of pointing out this is just the advantages of Spark, there are quite a few useful things on top. E.g. GraphX for graph processing, MLLib for easy machine learning, Spark SQL for BI, BlinkDB for insane fast apprx queries, and as mentioned Spark Streaming
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Spark code base is much smaller.
Hope this helps.