Error using spark 'save' does not support

2020-06-21 10:46发布

I have a DataFrame which I am trying to partitionBy a column, sort it by that column and save in parquet format using the following command:

df.write().format("parquet")
  .partitionBy("dynamic_col")
  .sortBy("dynamic_col")
  .save("test.parquet");

I get the following error:

reason: User class threw exception: org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now;

Is save(...) not allowed? Is only saveAsTable(...) allowed which saves the data to Hive?

Any suggestions are helpful.

标签： apache-spark apache-spark-sql partitioning parquet

2条回答

神经病院院长

2楼-- · 2020-06-21 11:25

The problem is that sortBy is currently (Spark 2.3.1) supported only together with bucketing and bucketing needs to be used in combination with saveAsTable and also the bucket sorting column should not be part of partition columns.

So you have two options:

Do not use sortBy:

df.write
.format("parquet")
.partitionBy("dynamic_col")
.option("path", output_path)
.save()

Use sortBy with bucketing and save it through the metastore using saveAsTable:

df.write
.format("parquet")
.partitionBy("dynamic_col")
.bucketBy(n, bucket_col)
.sortBy(bucket_col)
.option("path", output_path)
.saveAsTable(table_name)

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2020-06-21 11:26

Try

df.repartition("dynamic_col").write.partitionBy("dynamic_col").parquet("test.parquet")

0人赞添加讨论(0) 举报

Error using spark 'save' does not support

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间