pyspark: Difference performance for spark.read.for

2019-08-27 19:02发布

问题:

Anyone knows what is the difference between spark.read.format("csv") vs spark.read.csv?

Some say "spark.read.csv" is an alias of "spark.read.format("csv")", but I saw a difference between the 2. I did an experiment executing each command below with a new pyspark session so that there is no caching.

DF1 took 42 secs while DF2 took just 10 secs. The csv file is 60+ GB.

DF1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv")

DF2 = spark.read.option("header", "true").csv("hdfs://bda-ns/user/project/xxx.csv")

The reason why I dig on this issue was because I have need to do a union on 2 dataframes after filter and then write back to hdfs and it took super long time to write (still writing after 16 hrs....)

回答1:

Basically they are totally the same when you call one of them. But in you implementations are difference

With DF1, you add inferSchema option, it will slow down the process, that explains why DF1 took more time than the second

inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default, Detail document