how to convert rdd to list effectively without usi

2020-04-14 01:40发布

问题:

We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).

Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?

Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).

Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling

List<String> myList= RDD.collect.toList (which effects performance)

I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.

回答1:

Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?

No, and there can't be. And if there were such a way, collect would be implemented using it in the first place.

Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless.

So why to have an API which we can not even use (Or am I missing something).

collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach or other actions instead.



回答2:

As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. There is no way to get all data in collection by not getting data. The interpretation of problem space is wrong.



回答3:

collect and similar are not meant to be used in normal spark code. They are useful for things like debugging, testing, and in some cases when working with small datasets.

You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. Methods like collect which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway.