We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).
Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling
List<String> myList= RDD.collect.toList
(which effects performance)
I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
No, and there can't be. And if there were such a way, collect
would be implemented using it in the first place.
Well, technically you could implement List
interface on top of RDD
(or most of it?), but that would be a bad idea and quite pointless.
So why to have an API which we can not even use (Or am I missing something).
collect
is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach
or other actions instead.
As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. There is no way to get all data in collection by not getting data. The interpretation of problem space is wrong.
collect
and similar are not meant to be used in normal spark code. They are useful for things like debugging, testing, and in some cases when working with small datasets.
You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. Methods like collect
which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway.