Anyone knows what is the difference between spark.read.format("csv") vs spark.read.csv?
Some say "spark.read.csv" is an alias of "spark.read.format("csv")", but I saw a difference between the 2. I did an experiment executing each command below with a new pyspark session so that there is no caching.
DF1 took 42 secs while DF2 took just 10 secs. The csv file is 60+ GB.
DF1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv")
DF2 = spark.read.option("header", "true").csv("hdfs://bda-ns/user/project/xxx.csv")
The reason why I dig on this issue was because I have need to do a union on 2 dataframes after filter and then write back to hdfs and it took super long time to write (still writing after 16 hrs....)