I want to make my understanding about hadoop distributed cache clear. I know that when we add files to distributed cache, the files get loaded to the disk of every node in the cluster.
So how do the data of the files get transmitted to all the nodes in the cluster. Is it through the network? If so, will it not cause a strain on the network?
I have the following thoughts, are they correct?
If the files are large, wont there be network congestion?
If the number of nodes are large, even though the files are of medium or small size, the replication of the files and transmission to all nodes, wont it cause network congestion and memory constraints?
Please help me in understanding these concepts.
Thanks!!!
Yes the files are transferred via the network, usually via HDFS. It will cause no more strain on the network than using HDFS for anything that's a non data local task.
If the files are large, there is the possibility of network congestion, but you're already pushing your jar to all of these task trackers, so as long as your files are not too much bigger than your jar, your overhead shouldn't be too bad.
The replication of the files is completely separate from the number of task trackers that will eventually pull this file. The replication will be chained from node to node as well and will be the cost of having a fault tolerant distributed file system no matter what. Again, the network congestion is no more a problem than pushing your jar to all of the task trackers, assuming the files in the distributed cache are of equivalent size of your jars.
Overall the overhead of the distributed cache is minuscule as long as it is used as intended, as a way to push reasonably small cached data to be local to the task trackers doing the computation.
Edit: Here is the DistributedCache documentation for 0.20. Note that the files are specified via urls. Usually you would use something on your local hdfs:// setup.
I think what you understand for distributed cache are correct. Because I think so too :) Maybe increase the replication of distributed cache can reduce the network transfer