sparklyr can I pass format and path options into s

2019-07-20 18:18发布

问题:

Spark 2.0 with Hive

Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore

In Spark I would do that like this,

irisDf.write.format("orc")
    .mode("overwrite")
    .option("path", "s3://my_bucket/iris/")
    .saveAsTable("my_database.iris")

In sparklyr I can use the spark_write_tablefunction,

data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
   iris
  ,name = 'my_database.iris'
  ,mode = 'overwrite'
)

But this doesn't allow me to set path or format

I can also use spark_write_orc

spark_write_orc(
    iris
  , path = "s3://my_bucket/iris/"
  , mode = "overwrite"
)

but it doesn't have the saveAsTable option

Now, I CAN use invoke statements to replicate the Spark code,

  sdf <- spark_dataframe(iris_spark)
  writer <- invoke(sdf, "write")
  writer %>% 
    invoke('format', 'orc') %>% 
    invoke('mode', 'overwrite') %>% 
    invoke('option','path', "s3://my_bucket/iris/") %>% 
    invoke('saveAsTable',"my_database.iris")

But I am wondering if there is anyway to instead pass the format and path options into spark_write_table or the saveAsTable option into spark_write_orc?

回答1:

path can be set using options argument, which is equivalent to options call in the native DataFrameWriter:

spark_write_table(
  iris_spark, name = 'my_database.iris', mode = 'overwrite', 
  options = list(path = "s3a://my_bucket/iris/")
)

By default in Spark, this will create a table stored as Parquet at path (partition subdirectories can be specified with the partition_by argument).

As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName property, either on runtime

spark_session_config(
  sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)

or when you create a session.