SparkR vs sparklyr [closed]

2020-02-07 15:55发布

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best

7条回答
smile是对你的礼貌
2楼-- · 2020-02-07 16:20

I recently wrote an overview of the advantages/disadvantages of SparkR vs sparklyr, which may be of interest: https://eddjberry.netlify.com/post/2017-12-05-sparkr-vs-sparklyr/.

There's a table at the top of the post that gives a rough overview of the differences for a range of criteria.

I conclude that sparklyr is preferable to SparkR. The most notable advantages are:

  1. Better data manipulation through compatibility with dpylr
  2. Better function naming conventions
  3. Better tools for quickly evaluating ML models
  4. Easier to run arbitrary code on a Spark DataFrame
查看更多
迷人小祖宗
3楼-- · 2020-02-07 16:22

I can give you the highlights for sparklyr:

In the current 0.4 version, it does not support arbitrary parallel code execution yet. However, extensions can be easily written in Scala to overcome this limitation, see sparkhello.

查看更多
Animai°情兽
4楼-- · 2020-02-07 16:26

As I don't see too many answers which are in favour sparkR I just want to mention that as a newbie I started learning them both and I see that sparkR api is more closely related to the one I use with standard scala-spark. As I study them both I mean I want to use rstudio and also scala, I need to choose between sparkr and sparklyr. Learning sparkR together with scala-spark api, seem's to be of less effort than learning sparklyr which is much more different at least in my perspective. However sparklyr appears more powerful. So for me it's a question of do you want to use the more powerful and commonly used library with more support from community or do you compromise and use the more similar api as in scala-spark that is at least my perspective on choosing.

查看更多
够拽才男人
5楼-- · 2020-02-07 16:27

The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:

https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function

Since sparklyr translates R to SQL, you can only use very small set of functions in mutate statements:

http://spark.rstudio.com/dplyr.html#sql_translation

That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).

Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.

查看更多
淡お忘
6楼-- · 2020-02-07 16:28

Being a wrapper, there are some limitations to sparklyr. For example, using copy_to() to create a Spark dataframe does not preserve columns formatted as dates. With SparkR, as.Dataframe() preserves dates.

查看更多
放荡不羁爱自由
7楼-- · 2020-02-07 16:28

... adding to the above from Javier...

That I can find so far, sparklyr does not support do(), making it of use only when you want to do what's permitted by mutate, summarise, etc. Under the hood, sparklyr is transforming to Spark SQL, but doesn't (yet?) transform do() to something like a UDF.

Also, that I can find so far, sparklyr doesn't support tidyr, including unnest().

查看更多
登录 后发表回答