SparkR vs sparklyr [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 2 years ago.

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best

标签： r apache-spark sparkr sparklyr

7条回答

smile是对你的礼貌

2楼-- · 2020-02-07 16:20

I recently wrote an overview of the advantages/disadvantages of SparkR vs sparklyr, which may be of interest: https://eddjberry.netlify.com/post/2017-12-05-sparkr-vs-sparklyr/.

There's a table at the top of the post that gives a rough overview of the differences for a range of criteria.

I conclude that sparklyr is preferable to SparkR. The most notable advantages are:

Better data manipulation through compatibility with dpylr
Better function naming conventions
Better tools for quickly evaluating ML models
Easier to run arbitrary code on a Spark DataFrame

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2020-02-07 16:22

I can give you the highlights for sparklyr:

Supports dplyr, Spark ML and H2O.
Distributed on CRAN.
Easy to install.
Extensible.

In the current 0.4 version, it does not support arbitrary parallel code execution yet. However, extensions can be easily written in Scala to overcome this limitation, see sparkhello.

0人赞添加讨论(0) 举报

Animai°情兽

4楼-- · 2020-02-07 16:26

As I don't see too many answers which are in favour sparkR I just want to mention that as a newbie I started learning them both and I see that sparkR api is more closely related to the one I use with standard scala-spark. As I study them both I mean I want to use rstudio and also scala, I need to choose between sparkr and sparklyr. Learning sparkR together with scala-spark api, seem's to be of less effort than learning sparklyr which is much more different at least in my perspective. However sparklyr appears more powerful. So for me it's a question of do you want to use the more powerful and commonly used library with more support from community or do you compromise and use the more similar api as in scala-spark that is at least my perspective on choosing.

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2020-02-07 16:27

The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:

https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function

Since sparklyr translates R to SQL, you can only use very small set of functions in mutate statements:

http://spark.rstudio.com/dplyr.html#sql_translation

That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).

Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.

0人赞添加讨论(0) 举报

淡お忘

6楼-- · 2020-02-07 16:28

Being a wrapper, there are some limitations to sparklyr. For example, using copy_to() to create a Spark dataframe does not preserve columns formatted as dates. With SparkR, as.Dataframe() preserves dates.

0人赞添加讨论(0) 举报

放荡不羁爱自由

7楼-- · 2020-02-07 16:28

... adding to the above from Javier...

That I can find so far, sparklyr does not support do(), making it of use only when you want to do what's permitted by mutate, summarise, etc. Under the hood, sparklyr is transforming to Spark SQL, but doesn't (yet?) transform do() to something like a UDF.

Also, that I can find so far, sparklyr doesn't support tidyr, including unnest().

0人赞添加讨论(0) 举报

1 2 下一页

SparkR vs sparklyr [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间