So, I want to read and RDD into an array. For that purpose, I could use the collect method. But that method is really annoying as in my case it keeps on giving kyro buffer overflow errors. If I set the kyro buffer size too much, it starts to have its own problems. On the other hand, I have noticed that if I just save the RDD into a file using the saveAsTextFile method, I get no errors. So, I was thinking, there must be some better method of reading an RDD into an array which isn't as problematic as the collect method.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
No. collect
is the only method for reading an RDD into an array.
saveAsTextFile
never has to collect all the data to one machine, so it is not limited by the available memory on a single machine in the same way that collect
is.
回答2:
toLocalIterator()
This method returns an iterator that contains all of the elements in this RDD.The iterator will consume as much memory as the largest partition in this RDD. Processes as RunJob to evaluate one single partition on each step.
>>> x = rdd.toLocalIterator()
>>> x
<generator object toLocalIterator at 0x283cf00>
then you can access the elements in rdd by
empty_array = []
for each_element in x:
empty_array.append(each_element)
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#toLocalIterator()