Return an RDD from takeOrdered, instead of a list

2019-04-10 10:42发布

问题:

I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:

(self.spark_context.textFile(old_filepath+filename)
    .takeOrdered(100) 
    .saveAsTextFile(new_filepath+filename))

My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.

AttributeError: 'list' object has no attribute 'saveAsTextFile'

Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.

Isn't there any way to return an RDD from takeOrdered or an equivalent function?

回答1:

takeOrdered() is an action and not a transformation so you can't have it return an RDD.
If ordering isn't necessary, the simplest alternative would be sample().
If you do want ordering, you can try some combination of filter() and sortByKey() to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result of takeOrdered()