Spark partitioning/cluster enforcing

2019-07-13 00:57发布

问题:

I will be using a large amount of files structured as follows:

/day/hour-min.txt.gz

with a total of 14 days. I will use a cluster of 90 nodes/workers.

I am reading everything with wholeTextFiles() as it is the only way that allows me to split the data appropriately. All the computations will be done on a per-minute basis (so basically per file) with a few reduce steps at the end. There are roughly 20.000 files; How to efficiently partition them? Do I let spark decide?

Ideally, I think each node should receive entire files; does spark do that by default? Can I enforce it? How?

回答1:

I think each node should receive entire files; does spark do that by default?

Yes, given that WholeTextFileRDD(what you get after sc.wholeTextFiles) has its own WholeTextFileInputFormat to read the whole files as a single record, you're covered. If your Spark executors and datanodes are co-located, you can also expect Node-local data locality. (You can check this in Spark UI once your application is running.)

A word of caution from note withing the Spark documentation for sc.wholeTextFiles:

Small files are preferred, large file is also allowable, but may cause bad performance.



回答2:

You could use the rule of thumb for your partitions:

data = data.coalesce(total_cores * 3) 

Ideally, I think each node should receive entire files; does spark do that by default? Can I enforce it? How?

It all depends on your RDD, not of your files. If you build an RDD that contains all the contents of the files for example, then Spark will distribute that RDD, and whether a whole file lies in a node is affected by many parameters (#partitions, size of every file, etc.).

I do not think you can enforce something like that, so focus on the number of partitions; which is critical.


As for the number of files, I had written in my pseudosite, that too few files, will result in huge files and may be just too big, too many files and you will have HDFS maintaining a huge amount of metadata, thus putting a lot of pressure to it.