Spark partitionBy much slower than without it

2020-06-04 03:23发布

站内文章 / Spark

31 0

祖国的老花朵

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I tested writing with:

 df.write.partitionBy("id", "name")
    .mode(SaveMode.Append)
    .parquet(filePath)

However if I leave out the partitioning:

 df.write
    .mode(SaveMode.Append)
    .parquet(filePath)

It executes 100x(!) faster.

Is it normal for the same amount of data to take 100x longer to write when partitioning?

There are 10 and 3000 unique id and name column values respectively. The DataFrame has 10 additional integer columns.

回答1:

The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).

Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.