spark access first n rows - take vs limit

2019-04-06 10:54发布

问题:

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

回答1:

This is because predicate pushdown is currently not supported in Spark, see this very good answer.

Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.