pyspark: Difference performance for

2019-08-27 19:02发布


Anyone knows what is the difference between"csv") vs

Some say "" is an alias of ""csv")", but I saw a difference between the 2. I did an experiment executing each command below with a new pyspark session so that there is no caching.

DF1 took 42 secs while DF2 took just 10 secs. The csv file is 60+ GB.

DF1 ="csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv")

DF2 ="header", "true").csv("hdfs://bda-ns/user/project/xxx.csv")

The reason why I dig on this issue was because I have need to do a union on 2 dataframes after filter and then write back to hdfs and it took super long time to write (still writing after 16 hrs....)


Basically they are totally the same when you call one of them. But in you implementations are difference

With DF1, you add inferSchema option, it will slow down the process, that explains why DF1 took more time than the second

inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default, Detail document

标签: csv pyspark