Hadoop - Large files in distributed cache

2020-02-07 07:21发布

问题:

I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts.

I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ?

(My cluster has about 13 nodes running on very powerful machines where each machine is able to host close to 10 map slots.)

Thanks

回答1:

"Cache" in this case is a bit misleading. Your 4 GB file will be distributed to every task along with the jars and configuration.

For files larger than 200mb I usually put them directly into the filesystem and set the replication to a higher value than the usual replication (in your case I would set this to 5-7). You can directly read from the distributed filesystem in every task by the usual FS commands like:

FileSystem fs = FileSystem.get(config);
fs.open(new Path("/path/to/the/larger/file"));

This saves space in the cluster, but also should not delay the task start. However, in case of non-local HDFS reads, it needs to stream the data to the task which might use a considerable amount of bandwidth.