Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?
Best
I recently wrote an overview of the advantages/disadvantages of SparkR vs sparklyr, which may be of interest: https://eddjberry.netlify.com/post/2017-12-05-sparkr-vs-sparklyr/.
There's a table at the top of the post that gives a rough overview of the differences for a range of criteria.
I conclude that
sparklyr
is preferable toSparkR
. The most notable advantages are:dpylr
I can give you the highlights for sparklyr:
In the current
0.4
version, it does not support arbitrary parallel code execution yet. However, extensions can be easily written in Scala to overcome this limitation, see sparkhello.As I don't see too many answers which are in favour
sparkR
I just want to mention that as a newbie I started learning them both and I see that sparkR api is more closely related to the one I use with standardscala-spark
. As I study them both I mean I want to userstudio
and also scala, I need to choose between sparkr and sparklyr. Learning sparkR together with scala-spark api, seem's to be of less effort than learning sparklyr which is much more different at least in my perspective. However sparklyr appears more powerful. So for me it's a question of do you want to use the more powerful and commonly used library with more support from community or do you compromise and use the more similar api as in scala-spark that is at least my perspective on choosing.The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:
https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function
Since sparklyr translates R to SQL, you can only use very small set of functions in
mutate
statements:http://spark.rstudio.com/dplyr.html#sql_translation
That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).
Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar
dplyr
functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.Being a wrapper, there are some limitations to
sparklyr
. For example, usingcopy_to()
to create a Spark dataframe does not preserve columns formatted as dates. WithSparkR
,as.Dataframe()
preserves dates.... adding to the above from Javier...
That I can find so far, sparklyr does not support do(), making it of use only when you want to do what's permitted by mutate, summarise, etc. Under the hood, sparklyr is transforming to Spark SQL, but doesn't (yet?) transform do() to something like a UDF.
Also, that I can find so far, sparklyr doesn't support tidyr, including unnest().