SparkR: Cannot read data at deployed workers, but

2019-07-25 04:13发布

问题:

New to spark and spakrR. For Hadoop, only have a file called winutils/bin/winutils.exe.

Running system:

  • OS
    • Windows 10
  • Java
    • Java version "1.8.0_101"
    • Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
    • Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
  • R
    • platform: x86_64-w64-mingw32
    • arch: x86_64
    • os: mingw32
  • RStudio:
    • Version 1.0.20 – © 2009-2016 RStudio, Inc.
  • Spark
    • 2.0.0

I can read data on my local machine, but on the deployed workers, I cannot do that.

Could anybody help me?

Run at local:

Sys.setenv(SPARK_HOME = "D:/SPARK2")
library(SparkR)
sparkR.session(master = "local[*]",  enableHiveSupport = FALSE,sparkConfig = list(spark.driver.memory="4g",spark.sql.warehouse.dir = "d:/winutils/bin",sparkPackages = "com.databricks:spark-avro_2.11:3.0.1"))

Java ref type org.apache.spark.sql.SparkSession id 1

multiPeople <- read.json(c(paste(getwd(),"people.json",sep = "/"),"D:/RwizSpark_Private/people2.json"))

rand_10m_x <- read.df(x = "./demo.csv",source = "csv", inferSchema="true",na.strings= "NA")

Run at deployed workers:

Sys.setenv(SPARK_HOME = "D:/SPARK2")
library(SparkR)
sparkR.session(master = "spark:/mymasterIP", enableHiveSupport = FALSE,appName = "sparkRenzhi", sparkConfig = list(spark.driver.memory="6g",spark.sql.warehouse.dir = "d:/winutils/bin",spark.executor.memory = "2g", spark.executor.cores= "2"),sparkPackages = "com.databricks:spark-avro_2.11:3.0.1")

Java ref type org.apache.spark.sql.SparkSession id 1

multiPeople <- read.json(c(paste(getwd(),"people.json",sep = "/"),"D:/RwizSpark_Private/people2.json"))

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.29.110.101): java.io.FileNotFoundException: File file:/D:/RwizSpark_Private/people2.json does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) at org.apache.hadoop.mapre

rand_10m_x <- read.df(x = "./demo.csv",source = "csv", inferSchema="true",na.strings= "NA")

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 11, 172.29.110.101): java.io.FileNotFoundException: File file:/D:/RwizSpark_Private/demo.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) at org.apache.hadoop.mapred.T