Spark partitionBy much slower than without it

2020-06-04 03:23发布

问题:

I tested writing with:

 df.write.partitionBy("id", "name")
    .mode(SaveMode.Append)
    .parquet(filePath)

However if I leave out the partitioning:

 df.write
    .mode(SaveMode.Append)
    .parquet(filePath)

It executes 100x(!) faster.

Is it normal for the same amount of data to take 100x longer to write when partitioning?

There are 10 and 3000 unique id and name column values respectively. The DataFrame has 10 additional integer columns.

回答1:

The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).

Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.