I want to query my database and store only the top 1 million records sorted by some column say column1. I have tried two approaches
I load parquet files into a dataframe from the HDFS and apply SQL Query to it and then I save the the complete dataframe(10 million records) as text files on HDFS.
df = sqlContext.sql("SELECT * FROM table order by column1") df.rdd.saveAsTextFile("<path>")
I then read the text files and fetch 1 million records from the text file.
I limit the SQL query to 1 million records.
df = sqlContext.sql("SELECT * FROM table order by column1 LIMIT 1000000") df.rdd.saveAsTextFile("<path>")
But the second approach is much slower. I found that in second case dataframe returned by SQL query(df) contains only 1 partition and thus it is written in a single task. Repartitioning the dataframe improved the performance in second case but it was still slower than first case.
Please can anybody suggest an approach to save the dataframe faster in case 2 or any other approach to achieve the same task