I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:
(self.spark_context.textFile(old_filepath+filename)
.takeOrdered(100)
.saveAsTextFile(new_filepath+filename))
My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.
AttributeError: 'list' object has no attribute 'saveAsTextFile'
Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.
Isn't there any way to return an RDD from takeOrdered or an equivalent function?