I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:
(self.spark_context.textFile(old_filepath+filename)
.takeOrdered(100)
.saveAsTextFile(new_filepath+filename))
My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.
AttributeError: 'list' object has no attribute 'saveAsTextFile'
Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.
Isn't there any way to return an RDD from takeOrdered or an equivalent function?
takeOrdered()
is an action and not a transformation so you can't have it return an RDD.If ordering isn't necessary, the simplest alternative would be
sample()
.If you do want ordering, you can try some combination of
filter()
andsortByKey()
to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result oftakeOrdered()