SparkR vs sparklyr [closed]

2020-02-07 15:55发布

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best

7条回答
淡お忘
2楼-- · 2020-02-07 16:36

For the overview and indepth details, you may refer to the documentation. Quoting from the documentation, "the sparklyr package provides a complete dplyr backend". This reflects that sparklyr is NOT a replacement to the original apache spark but an extension to it.

Continuing further, talking about its installation (I'm a Windows user) on a standalone computer you would either need to download and install the new RStudio Preview version or else execute the following series of commands in the RStudio shell,

> devtools::install_github("rstudio/sparklyr")

install readr and digest packages if you do not have them installed.

install.packages("readr")
install.packages("digest")
library(sparklyr)
spark_install(version = "1.6.2")`

Once the packages are installed and you try to connect Connecting to local instance of spark using the command;

sc <- spark_connect(master = "local")

You may see an error such as

Created default hadoop bin directory under: C:\spark-1.6.2\tmp\hadoop Error:

To run Spark on Windows you need a copy of Hadoop winutils.exe:

  1. Download Hadoop winutils.exe from
  2. Copy winutils.exe to C:\spark-1.6.2\tmp\hadoop\bin

Alternatively, if you are using RStudio you can install the RStudio Preview Release which includes an embedded copy of Hadoop winutils.exe.

The error resolution is given to you. Head over to the github account, download the winutils.exe file and save it to the location, C:\spark-1.6.2\tmp\hadoop\bin and try creating the spark context again. Last year I published a comprehensive post on my blog detailing the installation and working with sparkR on windows environment.

Having said that, I would recommend not to go through this painful path of installing a local instance of spark on the usual RStudio, rather try the RStudio Preview version. It will greatly save you the hassle of creating the sparkcontext. Continuing further, here is a detailed post on how sparklyr can be used R-bloggers.

I hope this helps.

Cheers.

查看更多
登录 后发表回答